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Chapter 1 



Preamble 



This manuscript originates from an attempt to address the so-called "signal-to-symbol 
barrier." Perceptual agents, from plants to humans, perform measurements of physi- 
cal processes ("signals") at a level of granularity that is essentially continuous. 1 They 
also perform actions in the continuum of physical space. And yet, cognitive science, 
primary epistemics, and in general modern philosophy, associate "intelligent behavior" 
with some kind of "internal representation" made of discrete symbols (e.g. "concepts", 
"ideas", "objects", "categories") that can be manipulated with the tools of logic or 
probabilistic inference. But little is known about why such a "signal-to-symbol" con- 
version should occur, how it would yield an evolutionary advantage, or what principles 
should guide it. 

Traditional Information Theory and Statistical Decision Theory suggest that it may 
be counter-productive: If we consider biological systems as machines that perform ac- 
tions or make decisions in response to stimuli in a way that maximizes some decision 
or control objective, then the Data Processing Inequality 2 indicates that the best possi- 
ble agents would avoid data analysis, 3 i.e. , the process of breaking down the signals 
into discrete entities, or for that matter any kind of intermediate decision unrelated to 
the final task, as would instead be necessary to have a discrete internal representation 
made of symbols. 4 These considerations apply regardless of the specific control or de- 



GREEN FOR SIGNAL PROCESSING AND INFORMATION THE- 
ORY 



! The continuum is an abstraction, so here "continuous" is to be understood as existing at a level of gran- 
ularity significantly finer than the resolution of the measurement device or actuator. For instance, although 
retinal photoreceptors are finite in number, we do not perceive discontinuities due to retinal sampling. Scal- 
ing phenomena and the scale-invariance of natural image statistics make the continuum limit relevant to 
visual perception, unlike other sensory modalities. 

2 See for instance page 88 of [165], or Section 2.4.1. 

3 Note that I refer to data analysis as the process of "breaking down the data into pieces" (cfr. gr. 
analyein), i.e. the generally lossy conversion of data into discrete entities (symbols). This is not the case for 
global representations such as Fourier or wavelet decompositions, or principal component analysis (PCA), 
that are unfortunately often referred to as "analysis" because developed in the context of harmonic analysis, 
a branch of mathematics. The Fourier transform is globally invertible, which implies that there is no loss of 
data, and PCA consists in linear projections onto subspaces. 

4 Discretization is often advocated on complexity grounds, but complexity calls for data compression, 
not necessarily for data analysis. Any complexity cost could be added to the decision or control functional 
in Section 2.4.1, and the best decision would still avoid data analysis. For instance, to simplify a segment 
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cision task, from the simplest binary decision (e.g. "is a specific object present in the 
scene?") to the most complex conceivable task (e.g. "survival"). 

So, why would we need, or even benefit from, an internal representation? Is "intel- 
ligence" not possible in an analog setting? Or is data analysis necessary for cognition? 
If so, what would be the mathematical and computational principles that guide it? 

In the fields of Signal Processing, Image Processing, and Computer Vision, we rou- 
tinely perform pre-processing of data (filtering, sampling, anti-aliasing, edge detection, 
feature selection, segmentation, etc.), seemingly against the basic tenets of Information 
and Decision Theory. The latter would instead suggest an approach whereby images 
are fed directly into a "black-box" decision or control machine designed to optimize 
a (possibly very complex) cost functional. Such an approach would outperform one 
based on a generic, task-independent "internal representation." Should we then discard 
decades of work in attempting to infer such internal representations? 

Of course, one could argue that data analysis in biological systems is not guided 
by any principle, but an accident due to the constraints imposed by biological hard- 
ware, as in [190], where Turing showed that (continuous) reaction-diffusion partial 
differential equations (PDEs) that govern ion concentrations in neurons exhibit dis- 
crete/discontinuous solutions. This is why the signal is encoded in discrete "spikes," 
and thence information in discrete symbols. But if we want to build machines that 
interact intelligently with their surroundings and are not bound by the constraints of 
biological hardware, should we draw inspiration from biology, or can we do better by 
following the principles of Information and Decision Theory? 

The question of the existence of an "internal representation" is best framed within 
task the scope of a task, which provides a falsifiability mechanism. 5 As we already men- 

tioned, a task can be as narrow as a binary decision, or as general as "survival." In 
THE 4 R'S OF VISION the context of visual analysis I distinguish four broad classes of tasks, which I call the 

four "R 's " of vision: Reconstruction (building models of the geometry of the scene 
- shape), Rendering (building models of the photometry of the scene - material), 
Recognition (or, more in general, vision-based decisions such as detection, localiza- 
tion, recognition, categorization pertaining to semantics), and Regulation (or, more in 
general, vision-based control such as tracking, navigation, obstacle avoidance, manip- 
ulation etc.). 

For Reconstruction and Rendering, I am not aware of any principle that suggests 
an advantage in data analysis. It is not surprising that the current best approaches 
to infer the geometry and photometry of a scene from collections of images recover 
(piecewise) continuous surfaces and radiance functions directly from the data [90], 
unlike the traditional multi-step pipeline 6 that was long favored on complexity grounds 

of a radio signal one could represent it as a linear combination of a small number of (high-dimensional) 
bases, so few numbers (the coefficients) are sufficient to represent it in a parsimonious manner. This is 
different than breaking down the signal into pieces, e.g. partitioning its domain into subsets, as implicit in 
the process of encoding a visual signal through a population of neurons each with a finite receptive field. So, 
is there an evolutionary advantage in data analysis, beyond it being just a way to perform data compression? 
Another example, with motivations other than compression, is provided by the packet encoding of data for 
transmission in large networks such as the Internet. 

5 Of course one could construe the inference of the internal representation as the task itself, but this would 
be self -referential. 

6 A sequence of "steps" including point-feature selection, wide-baseline matching, epipolar geometry 
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(or pedagogical motivations, see [ ] and references therein). 

In this manuscript, we will explore the issue of representation from visual data 
for decision and control tasks. To avoid philosophical entanglements, we will not at- 
tempt to define "intelligent behavior" or even "knowledge," other than to postulate that 
knowledge - whatever it is - comes from data, but it is not data. This leads to the 
notion of the "useful portion" of the data, which one might call "information." So, OUr INFORMATION 
first step will be a definition of what "information" means in the context of performing 
a decision or action based on sensory data. 

As we will see, visual perception plays a key role in the signal- to- symbol barrier. 
As a result, much of this manuscript is about vision. Specifically, the need to perform 
decision and control tasks in a manner that is independent of nuisance factors includ- NUISANCE FACTORS 
ing scaling and occlusion phenomena that affect the image formation process leads to 
an internal representation that is intrinsically discrete, and yet lossless, in a sense to 
be made clear. However, for this to happen the perceptual agent (or, more in general, 
its evolved species) has to exercise control over certain aspects of the sensing process, control 
This inextricably ties sensing, information and control. A case-in-point is provided by 
Sea Squirts, or Tunicates, shown in Figure 1.1. These are organisms that possess a ner- 
vous system (ganglion cells) and the ability to move (they are predators), but eventually 
settle on a rock, become stationary and thence swallow their own brains. 7 Scaling and 
occlusion play a critical role: The first makes the continuum limit relevant, the second 
makes control a critical element in the analysis. These are present in a number of re- 
mote sensing modalities, including optical, infrared, multi- spectral imaging, as well as 
active ranging such as radar, lidar, time-of- flight, etc. 

1.1 How to read this manuscript 

This manuscript is designed to allow different levels of reading. Some of the material 
requires some background beyond calculus and linear algebra. To make the manuscript 
self-contained, basic elements of topology, variational methods and optimization, im- 
age processing, radiometry, etc. are provided in a series of appendices. These are 
color coded. The parts of the main text that require background in the corresponding 
discipline are coded with the same color. The reader can then either read through the 
colored text if he or she is familiar with that subject, disregarding the appendices, or 
use the appendix as a reference in case he or she is not familiar with the subject, or 
skip the colored text altogether. The manuscript is structured to allow getting the "big 
picture" without any mathematical formalism by just reading the black text. 

Summary for the experts (to be skipped by others) 

This section summarizes the content of the manuscript in a succinct manner. It can be 
used as a summary, or as a reference to the broader picture while reading the rest of 

estimation, motion estimation, triangulation, epipolar rectification, dense re-matching, surface triangulation, 
mesh polishing, texture mapping. 

7 This is sometimes used as a metaphor of tenure in academic institutions. 
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Figure 1.1: The Sea Squirt, or Tunicate, is an organism capable of mobility until it 
finds a suitable rock to cement itself in place. Once it becomes stationary, it digests its 
own cerebral ganglion cells, or "eats its own brain" and develops a thick covering, a 
"tunic" for self defense. 7 



the manuscript. For most readers, this summary will be cryptic or confusing at a first 
reading. If it was obvious, there would be no need for a manuscript. 

The long-term motivation for this study is three-fold. First, a desire to see a theory of information emerge in support 
of decision and control tasks, as opposed to transmission and storage of data. Then, a desire to enable the computation 
of performance bounds in a vision-based decision or control task, for instance bounds on the probability of error in visual 
recognition. Finally, the desire to give grounding to low-level vision operations such as segmentation, edge detection, feature 
selection etc. It goes without saying that these problems cannot be settled by one person in one manuscript, so this work 
aims to make a few steps in the direction of these long-term goals. 

What makes vision special, compared to other sensory modalities (auditory, tactile, olfactory etc.) is the fact that 
nuisance factors in the data formation process account for almost all the complexity of the data [ ]. Such factors include 
invertible nuisances such as contrast and viewpoint (away from occlusions), and non-invertible ones such as occlusions, 
quantization, noise, and general illumination changes. After discounting the effects of the nuisances in the data (invariance), 
even if one had started with infinite-resolution data, what is left is "thin" (supported on a zero-measure subset of the image 
domain). Its complexity measures the Actionable Information in the data. The fact that Actionable Information can be thin 
in the data is relevant in the context of the signal-to-symbol barrier problem. 

How can we deal with such nuisances? At decision time one can marginalize them (Bayes) or search for the ones that 
best explain the data (max-out, or maximum-likelihood). Some, however, may be eliminated in a process called canoniza- 
tion. While marginalization and max-out require solving complex integration or optimization problems at decision time, 
canonization can be pre-computed, and hence it enables straight comparison of statistics at decision time. It is preferable if 
time-complexity is factored in. However, this benefit comes with a predicament, in that canonization cannot decrease the 
expected risk, but at best leave it unchanged. Among the statistics that leave the risk unchanged (sufficient statistics), the 
ones that are also invariant to the nuisances would be the ideal candidates for a representation: They would contain all and 
only the functions of the data that matter to the task. 

Unfortunately, while for invertible nuisances one can construct complete features (invariant sufficient statistics), that 
act as a lossless representation (for decisions tasks), nuisances such as occlusion and quantization are not invertible. Thus, 
there is a gap between the maximal invariant and the minimal sufficient statistics. This gap cannot, in general, be filled by 
processing passively gathered data. 

However, when one can exercise some form of control on the sensing process, then some non-invertible nuisances 
can become invertible! Occlusions can be inverted by moving around the occluder. Scaling/quantization can be inverted 
by moving closer. Even the effects of noise can be countered by increasing temporal sampling and performing a suitable 
averaging operation. Therefore, in an active sensing scenario one can construct representations that are (asymptotically) 
lossless for decision and control tasks, and yet have low complexity relative to the volume of the raw data. This inextri- 
cably ties sensing and control. It also may enable achieving provable bounds, by generalizing Rate-Distortion theory to 
Perception-Control tradeoffs, whereby the "amount of control authority" over the sensing process (to be properly defined) 
how the theory unfolds trades off the expected error in a visual decision task. 
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In this manuscript, we characterize representations as complete invariant statistics, and call hallucination the simu- 
lation of the data formation process starting from a representation (as opposed to the actual scene). We define Actionable 
Information as the complexity of the maximal invariant statistics of the data, and Complete Information as the complexity 
of the (minimal sufficient statistic of a) complete representation. We define co-variant detectors, that enable the process of 
canonization, and their associated invariant descriptors. We define canonizability, and address the following questions: (i) 
When is a classifier based on an invariant descriptor optimal? (in the sense of minimizing the expected risk) (ii) what is 
the best possible descriptor? (iii) what nuisances are canonizable? (and therefore can be dealt with in pre-processing, as 
opposed to having to be marginalized or max-outed at decision time). 

Four concepts introduced in this study are key to the analysis and design of practical systems for performing visual 
decision and control tasks: Canonizability, Commutativity, Structural Stability, and Proper Sampling. 

Because of the presence of non-invertible nuisances, canonizability is not sufficient for optimality, as nuisances must 
also commute with one another. We show that the only nuisance that is canonizable and commutative is the isometric group 
of the plane. So, unlike common practice, affine transformations in general, and the scale group in particular, should not be 
canonized, but should instead be sampled and marginalized. 

Canonizing functionals, designed to select an element of the canonizable nuisance group, should be stable with respect 
to variations of the non-canonizable nuisances. We introduce the notion of Structural Stability, that is related to catastrophe 
theory and persistent topology. Selection by maximum structural stability margins gives rise to a novel feature selection 
scheme [1 16]. 

Whether the structure detected by a canonizing functional is "real" {i.e. it arises from phenomena in the scene) or 
"aliased" (i.e. it originates from artifacts of the image formation process, for instance quantization) depends on whether the 
signal is properly sampled. We introduce a notion of proper sampling that, unlike traditional (Nyquist-Shannon) sampling, 
cannot be decided based on a single datum (one image snapshot), but instead requires multiple images. This notion gives 
rise to a novel feature tracking scheme [116]. 

This also brings the issues of mobility and time front and center in both the theory, and in the implementation of 
effective recognition systems. It gives rise to a novel set of descriptors, such as the Best Template Descriptor and Time 
HOG, and to visual recognition schemes that are based on video [ ], as opposed to collections of isolated snapshots. 

Intra-class variability can then be captured by endowing the space of representations (which are discrete entities) 
with a probabilistic structure and learning distributions of individual objects or parts, clustered by labels. Objects are not 
necessarily rigid/static, but can also include "actions" or "events" that unravel in time. Time can be treated as yet another 
nuisance variable, which unfortunately is not invertible and therefore cannot be canonized without a loss. Time is, therefore, 
best dealt with by marginalization or max-out at decision time [156]. 

Along the way in our investigation we also discuss the role of "textures" and their dual ("structures"), and characterize 
them as the complement of canonizable regions [23]. 

The elements of this theory are presented following the trace below: 

1 . The starting point is the notion of visual decision leading to a definition of "visual information" as the part of the 
data that matters for the task. "Knowledge," "cognition," "understanding," "semantics," "meaning," are higher- 
order concepts that will will remain undefined and not tackled in this manuscript (Section 2.1). 

2. Visual decision problems (including detection, localization, recognition, categorization) are classification tasks 
based on data measured from images or video {/}. A class is identified by a label c, and specified by means of 
examples (e.g. a training set, Section 2.2). 

3. Optimal classification is based on a risk functional R. To compute a risk functional we need a forward (image- 
formation) model (or equivalently suitable assumptions and priors). We will use a formal image-formation model 
denoted by / = h(£,v), whereby the data / depend on the "scene" £ (the part of the data that matters) and 
nuisance factors v (Section 2.3). 

4. Nuisance factors, or nuisances, include viewpoint, contrast and other invertible nuisances. They also include invertible vs. non-invertible nuisances 
scaling/quantization, occlusions and other non-invertible nuisances. Finally, nuisances may also include intra- 

individual variability and other higher-level structure/priors. One can deal with nuisances via marginalization 
(Bayes), extremization (Maximum Likelihood, ML), or canonization (features, detector s/descriptors, Section 2.5). 

5. What is left in the data that does not depend on the nuisances represents the (actionable) "information" present in 
the data (Section 2.6). 

6. While Bayes/ML are optimal (they minimize the risk, under different priors), features are not, in general (Section 
2.7). 

7. The use of features brings about the notion of sufficient statistics, the "data processing inequality" (or Rao and 
Blackwell's theorem), the notions of invariance and representation, and the interplay between the representation 
and the classifier. This also raises three crucial questions: (a) When can features be used without a loss? (b) When 
there is a loss, can it be quantified and bounded? (c) Even classifiers that are optimal (Bayes, ML minimize risk) 

do not achieve zero risk. Are there bounds on the risk? (Chapter 3). no meaningful bounds for passive visual recogni- 

tion 

8 The characterization of images as "controlled hallucinations" was first introduced by J. Koenderink. 
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COMMUTATIVITY 
STRUCTURAL STABILITY 
PROPER SAMPLING 
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ACTIVE RECOGNITION BRINGS 
RECOGNITION BOUNDS" 



INFORMATION CONTROL 



8. A somewhat obvious, but nevertheless key observation is that for "passive" visual recognition, it is not possible to 
arrive at "meaningful" bounds on the risk. Meaningful in this context means that the error bound in the task of 
determining the class of a certain "object" does not depend on the specific instance of the object and the nuisances. 

ABOUT "CONTROLLED (Section 10.1). 

9. A second key observation is that for "active" visual recognition, the risk can be made arbitrarily small (asymptot- 
ically) [ ] . This brings about the role of time, and the asymmetric nature of training data (that require time) and 
test data (testing can be instantaneous). It also has some epistemological ramifications, and links the concept of 
"information" to "control" or "action." (Section 10.2). 



CONTROL-RECOGNITION THEORY 



10. For active recognition, the currency that trades off recognition performance (risk) is the control authority one can 
exercise over the sensing process (exploration). (Section 10.3). 



11. A notion of "representation" arises naturally in active recognition, and its complexity quantifies the "complete 
information." (Sections 2.5.4 and 3.1). 



STRUCTURAL STABILITY OF THE REPRESENTATION 
TEXTURE, SEGMENTATION 



Computing, or inferring, such a representation from the data entails the notion of invariance (for invertible nui- 
sances) and stability (for non-invertible nuisances). It involves common early-vision operations, such as texture 
analysis and segmentation. (Chapter 4). 



OCCLUSIONS, MOTION 



Inferring a representation highlights the critical role of time (and motion within physical space) in the represen- 
tation, and also the critical role of occlusions, without which a representation would not be justified on decision- 
theoretic grounds. (Chapters 6 and 8). 



14. For visual decisions regarding "objects" that have a temporal component (a.k.a. "events," or "actions," or "activi- 
ties"), time can be thought of as a nuisance factor and marginalized, max-outed, or canonized. (Chapter 7). 



1.2 Precursors 

The design and computation of visual representations for recognition has a long history, 
dating back at least to the Seventies and before (see [126, 180] and references therein). 
While Marr's representation using zero-crossings of differential operators was discred- 
ited because of instability in the reconstruction process (i.e. obtaining images back 
from their representations), reconstructing data is not the main purpose. Using the rep- 
resentation to support decision and control tasks is. Many have attempted to design 
representations that are tailored to recognition (as opposed to image reconstruction) 
tasks, some using similar ideas of extrema of scale-spaces constructed from differen- 
tial operators [123, 119]. However, most of these designs have been performed in an 
ad-hoc manner, guided by intuition, common sense, and some biological inspiration. 
As we have commented above, standard statistical decision theory does not offer suit- 
able principles, as it would call for the direct design of "super-classifiers" forgoing 
intermediate representations altogether, unless they are directly tied to the task. 

Recent work [ ] aims to frame the construction of invariant/sufficient representa- 
tions in the context of Active Vision, formalizing some ideas of J. J. Gibson [ ] . This 
approach, however, falls short on several counts. First, invariance is too much to ask. 
In the process of being invariant to general viewpoints, shape becomes indiscriminative 
[197]. And yet, we know we can discriminate objects that are deformed versions of the 
same material. Second, often priors are available, from training or otherwise, both on 
the nuisance and on the class, and such priors should be used. There is no point in 
requiring invariance to illuminations that will never be; better instead to be insensitive 
to common nuisances, and relax the representation where nuisances have low probabil- 
ity even though this opens the possibility of illusions for unlikely nuisance and scene 
combinations (Fig. 1.2). Third, even though the Active Vision paradigm is appealing, 
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Figure 1.2: In the absence of sufficiently informative data (e.g. one image), priors en- 
able classification that, occasionally, can be incorrect if the concomitance of nuisances 
and scene concur to a visual illusion. From non- accidental vantage point, the scene on 
the top-left looks like a girl picking up a ball (image courtesy of Preventable), rather 
than a flat painting on the road surface. Similarly, the purposeful collection of ob- 
jects on the bottom looks like meaningful symbols when seen from a carefully selected 
vantage point. 

in practice we often do not have control on the sensing platform. 

Therefore, there remains the need to properly treat nuisances that are non-invertible, 
such as occlusions, quantization and noise, in a passive sensing scenario, and to be 
able to exploit priors when available. 

The recent literature is studded with different approaches for low-level pre-processing 
of images for visual classification. These include various feature detectors and descrip- 
tors, too many to cite extensively, but the most common being [123, 135]. These are 
compared empirically on end-to-end tasks such as wide-baseline matching [ ], cat- 
egorization [1 17], or category localization and segmentation [ ] tasks. However, an 
empirical evaluation tells us which scheme performs better, but gives us no indication 
as to the relation between different schemes, no hint on how to improve them, and no 
bounds on the best achievable performance that can be extrapolated to other datasets 
with provable guarantees. 

Therefore, there remains the need to develop a framework for the analysis and 
design of feature detectors/descriptors, that allows rational comparison of existing de- 
scriptors, and engineering design of new ones, and understanding of the conditions 
under which they can be expected to perform. 

There is a sizable literature on the detection and computation of structures in im- 
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ages. In particular, [181] derive detectors based on a series of axioms and postu- 
lates. However, while these explain how such low-level representations should be 
constructed, they give no indication as to why they would be needed in the first place. 
Much of the motivation in this literature stems from biology, and in particular the struc- 
ture of early stages of processing in the primate visual system. In this manuscript, we 
take a different approach. Rather than trying to explain biology, we ask the question of 
what is the best representation to support recognition tasks, regardless of the architec- 
ture or hardware limitations. It will be interesting, afterwards, to investigate whether 
the primate visual system behaves in a way that is compatible with the principles de- 
veloped here. 

By its nature, this manuscript relates to a vast body of literature in low-level vi- 
sion and also Active Vision [2, 14]. Ideally it relates to every paper, by providing a 
framework where different approaches can be understood and compared. However, it 
is possible that some approaches may not fit into this framework. I hope that this work 
provides a seed that others can grow or amend. 

This manuscript also lends some analytical support for the notion of embodied 
cognition that has been championed by cognitive roboticists and philosophers including 
[30,112,194,148,131,14]. 

Finally, after presenting the material in these notes in a NIPS tutorial in 2010, 
the author became aware of the work of N. Tishby, who has been addressing similar 
questions using an information-theoretic questions, and work is underway to combine 
and reconcile the two approaches. 



Chapter 2 

Visual Decisions 



Visual classification tasks - including detection, localization, categorization, and recog- 
nition of general object classes in images and video - are challenging because of large 
in-class variability. For instance, the class "chair", defined as "something you can sit 
on" (presumably man-made), comprises a diversity of shapes, sizes and materials that 
exhibit a wide variety of correlates in the images (Figure 2.1). Even if one sets aside 






similar shape? 





similar function? 

Figure 2.1: Category recognition: Given images {ii, . . . , I n } of scenes that belong 
to a common class (left), does a new image I (right) portray a scene that belongs to the 
same class? 



within-class variability and considers the detection, localization or recognition of an in- 
dividual object (e.g. this chair) from a field of alternate hypotheses (e.g. other chairs), 
the data still exhibits large variability due to nuisance factors such as viewpoint, illu- nuisance 
mination, occlusion, etc., that have little to do with the identity of the object (Figure 
2.2). 

It is tempting to hope that a "super-classifier" fed with raw data could be trained 
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to, somehow, discard the variability in images due to nuisance factors, and reliably 
recognize object classes in pictures. The paper [ ] dampens such hopes by showing 
that the volume of the quotient of the set of images modulo changes of viewpoint and 
contrast is infinitesimal relative to the volume of the data. This means that a hypo- 
thetical "super-classifier" fed with raw images would spend almost all of its resources 
learning the effects of nuisance factors, rather than the identity of objects or their cat- 
egory. This would hold a-fortiori once complex illumination phenomena, occlusions, 
and quantization - all neglected in [176] - were factored in. 1 

It is equally tempting to hope that one could pre-process the data to obtain some 
"features," that do not depend on these nuisances, and yet retain all the "information" 
present in the data. Indeed, [ ] suggests a construction of such features that, how- 
ever, requires nuisances to have the structure of a group and hence breaks down in the 
presence of complex illumination effects, occlusions and quantization. One could re- 
lax such strict "invariance" requirement to some sort of "insensitivity" but, in general, 
pre-processing can only reduce the performance of any classifier downstream [165], 
which undermines the notion of "vision-as-pre-processing 2 for general-purpose ma- 
chine learning." Why should one perform 3 segmentation, edge detection, feature selec- 
tion and other generic low-level vision (pre-processing) operations, if the performance 
of whatever classifier using them decreases! 

This goes to the heart of a notion of "information." Ideally, the purpose of vision 
would be to "extract information" from images, where "information" intuitively re- 
lates to whatever portion of the data "matters" in some sense. Traditional Information 
Theory has been developed in the context data transmission, where one wants to re- 
produce as faithful as possible a copy of the data emitted by the source, after it has 
been corrupted by the channel. Thus the goal is reproduction of the data, with minimal 
distortion, and the "representation" simply consists of a compressed encoding of the 
data that exploits statistical regularity. In this context, every bit counts, and the seman- 
tic aspect of information is indeed irrelevant, as Shannon famously wrote. The theory 
yields a tradeoff between the minimum size of the representation as a function of the 
maximum amount of distortion. This tradeoff is computed explicitly for very simple 
cases (e.g. the memory less Gaussian channel), but nevertheless the general formulation 
of the problem is one of Shannon's most significant achievements. 

In our context, the data (images) are to be used for decision purposes (detection, 
localization, recognition, categorization). The goal is to minimize risk. In this context, 
following [177], there may be conditions where most of the data is useless, and the 
semantic aspect is fundamental, for the sufficient statistics can be discrete (symbols) 
even when the data lives in the continuum. 

Several have advocated the development of a theory, mirroring Shannon's Rate- 
Distortion Theory, to describe the minimum requirements (in terms of size of the rep- 

l T. Poggio recently put forward the hypothesis that this is true also in biology, in the sense that the 
complexity of the primate visual system is mainly to deal with nuisances, rather than to capture the intrinsic 
variability of the objects of interest [151]. 

2 For pre-processing to be sound, some kind of "separation principle" should hold, so that different mod- 
ules of a visual inference system could be designed and engineered independently, knowing that their com- 
position or interconnection would yield sensible end-to-end performance. 

3 As we have already pointed out in the preamble, computational efficiency alone does not justify the 
discretization process. 
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resentation, computational or other "cost") in order to have a recognition error that is 
bounded above. Ideally, like in Shannon's case, this bound could be made arbitrar- 
ily small by paying a high-enough price. Despite many efforts, no theory has emerged, 
only special cases restricted to imaging modalities where some of the crucial aspects of 
image formation (scaling and occlusions) are not manifest. This may not be by chance 
because, in the context of visual recognition, the worst-case scenario is an arbitrarily 
high error rate. This is not surprising, and indeed can be considered trivial. What may 
be a bit more surprising is that even the average-case scenario can be arbitrarily bad, 
as we will argue in Section 8.4.1. This may also shed some light on the limitations of 
benchmark datasets, if the performance of a given algorithm is interpreted as represen- 
tative of performance on other datasets. The result of the benchmarking are meaningful 
to the extent in which the dataset is representative of the scenarios one wishes to cap- 
ture, but no guarantee can be made on the generalization properties of these methods. 
Again, scale, quantization and occlusion conjure towards the failure of any "passive" 
recognition scheme to provide generalization bounds. What is surprising is that, if the 
data acquisition process can be controlled, then both the worst-case and average-case 
error can be bounded, and indeed they can be made (asymptotically) arbitrarily small 
(Section 8.4.3). Analogously to Shannon's Rate-Distortion theory, there is a tradeoff 
between the "control authority" one can exercise over the sensing process, and the 
performance in a decision task (Section 8.4.4). 

In the next section, we begin the formalization process necessary to answer some 
of the questions raised so far. 

2.1 Visual decisions as classification tasks 

By "visual decision" we mean any task of detection, localization, categorization and 
recognition of objects in images or video. These are all classification problems, where objects 
in some cases the class is a singleton (recognition), in other cases it can be quite general 
depending on functional or semantic properties of objects (Figure 2.1). Conceptually, 
they all require the evaluation and learning of the likelihood of the data (one or more 
images I) given the class label c: p(I\c). To simplify the narrative, we consider binary 
classifiers c G {0,1} with equal prior probability P(c) = \. Generalizations are 
conceptually, although not computationally, straightforward. 

A decision rule, or a classifier, is a function c : X — )> {0, 1} mapping the set of 
images onto labels. It is designed to keep the average loss from incorrect decisions 
small. A loss function A : {0, l} 2 — » R + ;(c, c) h->> A(c, c) maps two labels to a 
positive real value. We will consider, for simplicity, the symmetric — 1 loss, where 
A(c, c) = 1 — S(c, c), where S is Kronecker's delta, that is 1 if the labels are the same, 
if they are different. The average loss, a.k.a. conditional risk, is given by risk 



It can be shown [55] that the decision rule that minimizes the conditional risk, that is 




(2.1) 



c = c(J) = arg min R(I, c) 



(2.2) 
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is optimal in the sense that it minimizes the expected (Bayesian) risk 

R(c) = [ R(I,c)dP(I). 



(2.3) 



That is, if one could actually compute this quantity, which depends on the availability 
of the probability measure dP(I), which is tricky to even define, let alone learn and 
compute [138]. However, in the context of our investigation this is irrelevant: Whatever 
mathematical object dP{I) is, we can easily sample from it by simply capturing images 
I ~ dP(I), as we will see in Section 8. We will see in Section 3.1 that, even to generate 
"simulated images," we do not need access to the "true" distribution dP(I), but rather 
to a representation, which we will define in Section 3 and discuss in Section 3.1. 

Under the assumptions made, minimizing the conditional risk is equivalent to max- 
imizing the posterior p(c\I), which in turn (under equiprobable priors P(c = 1) = 
P(c = 0) = 1/2) is equivalent to maximizing the likelihood p(I\c): 



So, in a sense, the problem of visual decision-making, including detection, localization, 
recognition, categorization, is encapsulated in (2.4). That would be easy enough to 
solve if we could actually compute the likelihood. 

The difficulty in visual decision problems arises from the fact that the image I de- 
pends on a number of nuisance factors that do not depend on the class, and yet they 
affect the data. What is a nuisance depends on the task, and may include viewpoint, 
illumination, partial occlusions, quantization etc. (Figure 2.2). If we could, we would 
base our decision not on the data /, but on hidden variables £ that comprise the defining 
characteristics of the scene (object, category, location, event, activity etc.) that depend 
on the class c, through a Markov chain c — >• £ —> I. This would correspond to a data 
generation model whereby a sample c is selected from P(c), based on which a sample 
£ is selected from dQ c = dP(£\c), from which a measurement I is finally sampled via 
an image-formation functional I = h(£). 

However, because of the nuisances, we have to instead consider a generative model 
of the form / = h(£,v), where h is a functional that depends on the imaging device 
and v are all the nuisance factors. It is convenient to isolate within the nuisance v the 
additive noise component n arising from the compound effects of un-modeled uncer- 
tainty, although there is no added generality as n can be subsumed in the definition of 
v. It is also useful to isolate the nuisances that act as a group on the scene, g, although 
again we could lump them into the definition of v. If we model explicitly the group 
and the noise, we have a model of the form 



This is the formal model that we will adopt throughout the manuscript (Figure 2.2). 
In the next section we make this formal notation a bit more precise with a specific 
instantiation, the so-called Ambient-Lambert model. More realistic instantiations are 
described in Appendix B.l. The reader interested in generalizations of the simple sym- 
metric binary decision case can consult any number of textbooks, for instance [55]. 



c = arg max p(I\c). 

cG{0,l} V 



(2.4) 



(2.5) 
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v = viewpoint / = z>), £ 7^ £ 



Figure 2.2: The same scene £ can yield many different images depending on particular 
instantiations of the nuisance v. 

2.2 Image formation: The image, the scene, and the 
nuisance 

In this section, that can be skipped at first reading, we instantiate the formal notation 
(2.5) for a simple model used throughout the manuscript. All the symbols used, to- 
gether with their meaning, are summarized for later reference in Appendix D in the 
order in which they appear. This section is necessary to make the formal notation 
above meaningful. However, its content will actually not be used until Sections 3.1, 
3.4, and will be exploited in full only starting in Section 4. Therefore, the reader can 
skip this section at first reading, and come back to it, or to Appendix D, as needed. The 
model we introduce in this section is the simplest instantiation of (2.5) that is mean- 
ingful in the context of image analysis. More sophisticated models, and their relation 
to the simplest one introduced here, are described in Appendix B.l. 

An image / : D C M? —> R+; x \-> I{x) is a positive- valued map defined on a 
planar domain D; we will focus on gray-scale images, k = 1, but extensions to color 
k = 3 or multiple spectral bands can be made. The scene, indicated formally by £ = 
{5, p}, is described by its shape S and reflectance (albedo) p. The shape component 
S C M 3 is described by piecewise smooth surfaces. The surface S does not need to 
be simply connected, and can instead be made of multiple connected components, Si, 
with UiSi = S, each representing an object. The reflectance p is a function defined On OBJECT 
such surfaces, with values in the same space of the range of the image, p : S R k . 
In the presence of an explicit illumination model, p denotes the diffuse albedo of the 
surface. In the absence of an illumination model, one can think of the surfaces as 
"self-luminous" and p(p) denotes the energy emitted by an infinitesimal area element 
at the point p isotropically in all directions. We call the space of scenes E. Deviations 
from diffuse reflectance (inter-reflection, sub-surface scattering, specular reflection, 
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Figure 2.3: Image-formation model: The scene is described by objects whose geometry 
is represented by S and photometry (reflectance) p. The image, taken from a camera 
moving with g, is obtained by a central projection onto a planar domain D, and a 
contrast transformation. 



cast shadows) will not be modeled explicitly and lumped as additive errors n. 

Nuisance factors in the image-formation process are divided into two components, 
{g,v}, one that has the structure of a group g G G, and a component that is not a group, 
e.g. quantization, occlusions, cast shadows, sensor noise etc. We denote the image- 
formation model formally with a functional h, so that / = h(g : £, v) + n as in (2.5). 
This highlights the role of group nuisances g, the scene £, non-invertible nuisances, 
and the additive residual that lumps together all unmodeled phenomena including noise 
and quantization (non-additive noise phenomena can be subsumed in the non-invertible 
nuisance v). We often refer to the group nuisances as invertible and the non-group 
nuisances as non-invertible. 

The simplest instantiation of this model is the so-called Lambert- Ambient- Static 
(LAS) model, that approximates a static Lambertian scene seen under constant diffuse 
illumination, describing reflectance via a diffuse albedo function and thus neglecting 
complex reflectance phenomena (specularity, sub-surface scattering), and describing 
changes of illumination as a global contrast transformation and thus neglecting com- 
plex effects such as vignetting, cast shadows, inter-reflections etc. Under these as- 
sumptions, the radiance p emitted by an area element around a visible point p G S is 
modulated by a contrast transformation k (a monotonic continuous transformation of 
the range of the image) to give the irradiance I measured at a pixel element x, except 
for a discrepancy n : D — > . The correspondence between the point p G S and the 
pixel x G D is due to the motion of the viewer g G SE(3), the special Euclidean group 
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of rotations and translations in three-dimensional (3-D) space: 



I(x) = k o p(p) + n{x)\ 
x = 7r(gp); peS 



(2.6) 



where if p is represented by a vector X G R 3 , then ir : R 3 — >> R 2 ; X x = lambert-amb: 
[X1/X3, X2/X3P is a central perspective projection (Figure 2.4). This equation 
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Figure 2.4: Perspective projection: x is a point on the image plane, p is its pre-image, 
a point in space. Vice-versa, ifp is a point in space, x are the coordinates of its image, 
under perspective projection. 

is not satisfied for all x G D, but only for those that are projection of points on the 
object, x G D D n{gS). Also, not all points p G S are visible, but only those that 
intersect the projection ray closest to the optical center. If we call p\ the intersection 
of S with the projection ray through the origin of the camera reference frame and the 
pixel with coordinates x G 
graph of a scalar function Z : 
x G P 2 : 



represented by the vector x, we can write p\ as the 
Z(x), the depth map . Here a bar 



— > 



; x H> 



Euclidean coordinates x G 



2 denotes the homogeneous (projective) coordinates of the point with 

x 



P2. 



X 



1 



More in general, the pre-image of the 



point x (on the image plane) is a collection of points (in space) given by 



^s 1 ^) = {Pit--,Pn e S} = {g 1 xZ i (x)} 



N(x) 

1 • 



(2.7) 



Note that the number of points in the pre-image depends on x, and is indicated by N(x) . 
The pixel locations where two pre-images coincide are the occluding boundaries, also 
known as silhouettes. For instance, {x \ Z\{x) = Zj(x)} for some j = 2, . . . , N 
is an occluding boundary. If we sort the points in order of increasing depth, so that 
Z\[x) < Z<i{x) < • • • < Zn(x), then the pre-image, restricted to the point of first 
intersection, is indicated by 



Pi 



- x xZ(x) 



(2.8) 
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where we indicate Z\{x) simply with Z(x). We omit the subscript S in the pre-image 
for simplicity of notation, although we emphasize that it depends on the geometry of 
the surface S. We also omit the temporal index t, although we emphasize that, even 
when the scene S is static (does not depend on time), changes of viewpoint g t will 
induce changes of the range map Z t (x). We also omit the contrast transformation k, 
that can also change over time, although we will re-consider all these choices later. 

Note that the image only exists for x G D D ir(gS), and the pre-image only exists 
for p G S D tt~ 1 (D). For the region of the image where the scene is visible, say at 
time t = 0, we can represent S as the graph of a function defined on the domain of the 
image Io, so p = 7r _1 (xq) = xoZ(xo). Here k can be taken to be the identity function, 
fco = Id, so that Iq(xq) = p(p) p = tt~ 1 (xo). Therefore, combining this equation 
with (2.6), we have Io(xq) = k o p o tt~ 1 (xo), with the geometric component being 

x = Trgp = 7r#7r -1 (£o) = w(x ) (2.9) 

where w : R 2 — >> R 2 denotes the domain deformation induced by a change of viewpoint 
g, and the photometric component being 



/ o w(xo) = k o Io(xo) (2.10) 

BRIGHTNESS CONSTANCY CONSTRAINT that is know as the "brightness constancy constraint equation." Note that both the two 

previous equations are only valid in 

x G D D w(D) = DDTrgir^iD) (2.11) 



which is called the co-visible region. It can be shown that the composition of maps 
w : R 2 —> R 2 ; xq ^ x = w(xo) = 7r(gxoZ(xo)) spans the entire group of diffeo- 
morphisms (Theorem 2 of [177]). In this model, visibility is captured by the map 
7r and its inverse. In particular, 7r : R 3 — )> R 2 maps any points in space p into one 
location x = ir(p) on the image plane, assumed to be infinite. So the image of a point 
p is unique. However, the pre-image of the image location x is not unique, as there are 
infinitely many points that project onto it. If we assume that the world is populated by 
opaque objects, then the pre-image tt^ 1 (x) consists of all the points on the surface(s) S 
that intersect the projection ray from the origin of the camera reference frame through 
the given image plane location x. This model does not include quantization, noise and 
other phenomena that are discussed in more detail in the appendix. 



2.3 Marginalization, extremization 

If we have prior knowledge on all the hidden variables, n, we can compute the 

likelihood in (2.4) by marginalization. This is conceptually trivial, but computationally 
prohibitive. Prior knowledge on the nuisance is encoded in a distribution dP(y), which 
may have a density p{y) with respect to a base measure d[i{y). Prior knowledge on 
the scene is encoded in the class-conditional distribution dQ c (£), which may again 
have a density q(£\c) with respect to a base measure dfi(^). The same goes with the 
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group g. 4 We can reasonably 5 assume that the residual, after all relevant aspects 
of the problem have been explicitly modeled, is a white zero-mean Gaussian noise 
n ~ A/*(0; E) with covariance E. Marginalization then consists of the computation of 
the following integral 

p(I\c) = f M(I - %, £, v)\ X)dP(g)dP(v)dQ c (Z). (2.12) 

The problem with this approach is not just that this integral is difficult to compute. 
Indeed, even for the simplest instantiation of the image formation model (2.6), it is not 
clear how to even define a base measure on the sets of scenes £ and nuisances v, let 
alone putting a probability on them, and learning a prior model. It would be tempting 
to discretize the model (2.6) to make everything finite-dimensional; unfortunately, be- 
cause of scaling and quantization phenomena in image formation, these models would 
have very limited practical use. This integral is costly to compute even for the simplest 
characterizations of the scene and the nuisances. Indeed, the space of scenes (shape, ra- 
diance distribution functions) does not admit a "natural" or "sufficient" discretization. 
However, if one was able to do so, then a threshold on the likelihood ratio, depending 
on the priors of each class, yields the optimal (Bayes) classifier [159]. 

An alternative to marginalization, where all possible values of the nuisance are 
considered with a weight proportional to their prior density, is to find the class together 
with the value of all the hidden variables that maximize the likelihood: 

p(I\c) = sup Af (I - h(g,tv))p(g)p(v)q(ti\c)- (2.13) 

This procedure of eliminating nuisances by solving an optimization problem is called 
"extremization" or sometimes "max-out" of the hidden variables. A threshold on the 
result yields the maximum likelihood (ML) classifier. Needless to say, this is also a 
complex procedure. This procedure also relates to registration or alignment between 
(test) data and a "template," a common practice in image analysis. 

The important aspect of both marginalization and max-out procedures is that they 
depend on both the test data and the training data, encoded in the class-conditional 
density, hence they have to be computed at decision-time and cannot in general be pre- decision-time 
computed. The next section describes an alternative, further elaborated in Section 3. 
Note that the max-out procedure can be understood as a special case of marginaliza- 
tion when the prior is uniform (possibly improper). Therefore, we will use the term 
marginalization to refer to either procedure, depending on whether or not a prior is 
available. 



4 One may encode complete ignorance of some of the group parameters by allowing a subgroup of G to 
have a uniform prior, a.k.a. uninformative, and possibly improper, i.e. not integrating to one, if the subgroup 
is not compact. 

5 If this is not the case, that is if the residual exhibits significant spatial or temporal structure, such structure 
can be explicitly modeled, thus leaving the residual white. 
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2.4 Features 

A feature is any deterministic function of the data, (f> : X — >• R x ; J ^ (f)(1), or some- 
times o J. In general, it maps onto a finite-dimensional vector space R K , although 
in some cases the feature could take values in a function space. Obviously, there are 
many kinds of features, so we are interested in those that are "useful" in some sense. 
The decision rule itself, c(I) is a feature. However, it does not just depend on the datum 
/, but also on the entire training set. Therefore, we reserve the nomenclature "feature" 
only for deterministic functions of the current data (sample), but not on the training set 
(ensemble). Deterministic functions of (ensemble) data are called "statistics." We call 
any statistic of the training set (but not of the test datum) a template. For instance, the 
class-conditional mean is a template. One could build a classifier as a combination of 
a feature and a template, but in general the classifier can operate freely on training and 
test data, and does not necessarily compute statistics on each set independently. 

One can think of a feature as any kind of "pre-processing" of the data. The ques- 
tion as to whether such pre-processing is useful is addressed by the data processing 
inequality. 

2.4.1 Data processing inequality 

Let R(I,c) be the conditional risk (2.1) associated with a decision c, and c : X —> 
{0, 1}; I i— » c(I) be the optimal classifier, defined as c = argmin c R(I, c). Let <j) : 
X —> R K be any feature. Then, if R(c) is the Bayes risk (2.3) associated with the 
classifier c, we have that 

min R(c) < min R(c o <p). (2.14) 

c c 

In other words, there is no benefit in pre-processing the data. This results follows 
simply from the Markov chain dependency c —t I (f>, and is a consequence of Rao 
and Blackwell's theorem ([165], page 88), known as "data processing inequality" ([44] 
Theorem 2.8.1, page 32, and the following corollary on page 33). Thus, it seems that 
the best one can do is to forgo pre-processing and just use the raw data. Even if the 
purpose of a feature (f)(1) is to reduce the complexity of the problem, this is in general 
done at a loss, because one could include a complexity measure in the risk functional 
R, and still be bound by (2.14). However, there are some statistics that are "useful" in 
a sense that we now discuss. 



2.4.2 Sufficient statistics 

Those statistics that maintain the "=" sign in the data processing inequality (2.14) are 
called sufficient statistics for the purpose of this manuscript. 6 Thus, (f) : X — >> R K is a 
sufficient statistic if min c R(c) = ming R(co<p). Of course, what is a sufficient statistic 
depends on the task, encoded in the risk functional R. A trivial sufficient statistic is 
the identity functional (f)(1) = I. Of all sufficient statistics, we are interested in the 
"smallest," i.e. , the one that is a function of all other sufficient statistics. This is called 
the minimal sufficient statistic, which we indicate with (f> y (I). 



6 This is a more loose notion than the independence condition in Neyman's factorization criterion. 
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Clearly, sufficiency and minimality represent useful properties of a statistic, and for 
this reason they deserve a name. A minimal sufficient statistic contains everything in 
the data that matters for the decision; no more (minimality) and no less (sufficiency). 
Another useful property of a feature is that it contains nothing that depends on the 
nuisances. 

2.4.3 Invariance 

In the image-formation model (2.5), we have isolated nuisances that have the structure 
of a group g, those that are not, v, and then the additive "noise" n. The latter is a 
very complex process that has a very simple statistical description (e.g. a Gaussian 
independent and identically-distributed process in space and time). Instead, we focus 
on the group nuisances, g G G and on the non-invertible 1 nuisances v. A statistic NON-INVERTIBLE NUISANCES 
is G-invariant (or ^-invariant) if it does not depend on the nuisance: (f)(1) = (f) o 
= ° v ) f° r an Y two nuisances g, g and for any v and £ G S (or 

0(1) = o h(g, £, i/) = o £, z>) for any two nuisances z/, z> and for any g G G 
and £ G S). One could similarly define an invariant feature for both the group and the 
non-invertible nuisances, although we will not end up using those. 

Any constant function is an invariant feature, (f)(1) — const, V /; obviously it is not 
very useful. Of all invariant features, we are interested in the "largest," in the sense that 
all other invariants are functions of it. We call this the maximal invariant, and indicate 
it with the symbol 4>q(I) or (j) A (I) when the group G is clear from the context. MAXIMAL INVARIANT 

In general, there is no guarantee that an invariant feature, even the maximal one, be 
sufficient. In the process of removing the effects of the nuisances from the data, one 
may lose discriminative power, quantified by an increase in the expected risk. Vice- 
versa, there is no guarantee that a sufficient statistic, even the minimal, be invariant. 
In the best of all worlds, one could have a minimal sufficient statistic that is also in- 
variant, or vice-versa that a maximal invariant that is sufficient. In this case, the 
feature (f) w (I) = (f) A (I) would be the best form of pre-processing one could hope for: 
It contains all and only the "information" the data contains about £, and have no de- 
pendency on the nuisance. We call a minimal sufficient invariant statistic a complete 
feature. Note that, in general, a complete feature is still not equivalent to £ itself, for COMPLETE FEATURE 
the map £ — » (f) may not be injective (one-to-one). However, it can be shown that when 
the nuisance has the structure of a group, one can define an invariant statistical model 
(Definition 7.1, page 268 of [ ]) and design a classifier, called equi-variant, that EQUI- VARIANT CLASSIFIER 
achieves the minimum (Bayesian) risk (see Theorem 7.4, page 269 of [159]). This can 
be done even in the absence of a prior, assuming a uniform ("un-informative" and pos- 
sibly improper) prior. For this reason, we will be focusing on the design of invariants 
to the group component of the nuisance, an refer to G-invariants as simply invariants. 

Therefore, if all the nuisances had the structure of a group G, i.e. when v = 0, the 
maximal invariant would also be a sufficient statistic, and this would be a very fortunate 
circumstance. Unfortunately, in vision this does not usually happen. That is, unless we 
do something about it (as we explore in Section 8). 



7 The nomenclature "non-invertible" for nuisances that are not group stems from the fact that the crucial 
property of group nuisances that we will exploit is their invertibility. 
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2.4.4 Representation 

Given a scene £, we call £(£) the set of all possible images that can be generated by 
that scene up to an uninformative 8 residual: 

C(Z)±{I~h(g,£,v), geG.ueV}. (2.15) 

We have omitted the additive noise term since, in general, it describes the compound 
effect of multiple factors that we do not model explicitly, and therefore it is only de- 
scribed in terms of its ensemble properties from which it can be easily sampled. The ~ 
sign above indicates that the scene image is determined up to the additive residual n, 
which is assumed to be spatially and temporally white, identically distributed with an 
isotropic density (homoscedastic). It is, by definition, uninformative. 

Given an image I, in general there are infinitely many scenes £ that could have 
generated it under some unknown nuisances, so that 

/ e £(£). (2.16) 

REPRESENTATION We call any such scene a representation. A representation is a feature, i.e. , a function 

of the data: It is the pre-image (under C) of the measured image I: £ G and 
takes values in the space of all possible scenes S. 

A trivial example 9 of a representation is the image itself glued onto a planar sur- 
face, that is £ = (5, p) with S = D C M? and p = I. Depending on how we define 
the nuisance v in relation to the additive noise n, the "true" scene may not actually be 
a viable representation, which has subtle philosophical implications. More in general, 
there is no requirement at this point that the representation £ be unique, or have any- 
thing to do with the "true" scene £, whatever that is. In Chapter 8 we will study ways 
to design sequences of representations that converge to an approximation of the "true" 
scene £. 

While this concept of representation is pointless for a single image, it is important 
when considering multiple images of the same scene. In this case, the requirement is 
that the single representation £ simultaneously "explains" an entire set of images {/}: 

ieC-\{I})eE. (2.17) 

Of all representations, we will be interested in either the "most probable," if we are 
lucky enough to have a prior on the set of scenes dQ(£), which is rare, or in the sim- 
plest one, which we call the minimal representation (a minimal sufficient statistic), and 
indicate with £ v . 

Given a scene £, we are particularly interested in the minimal representation that 
can generate all possible images that the original scene £ can generate. We call this a 

COMPLETE REPRESENTATION complete representation: 

8 For instance, a spatially and temporally independent homoscedastic noise process. 

9 Another example can be constructed from any partition of the image domain (a partition of the domain 
D is a collection of sets that are disjoint Qi n Qj = Sij and whose union equals D = U^f^) by 

assigning to each region Qi an arbitrary depth Zi, and constructing a piece- wise planar surface S, on which 
to back-project the image via p(p) = I(x) for all p = xZ{(x), x £ Qi. 
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i is complete iff £(|) = £(f ). (2.18) 

When clear from the context, we will omit the superscript v , and refer to a minimal 
complete representation as simply the representation. The symbol C for the set of 
images that are generated by a representation is chosen because, as we will see, a 
complete representation is related to the light field of the underlying scene. Indeed, light held 
with any representation £ one could synthesize, or hallucinate, infinitely many images HALLUCINATION 
via the image-formation model (2.5) 

I~ %,£>), Vs,za (2.19) 

In other words, a representation is a scene from which the given data can be hallu- 
cinated. We will elaborate the issue of hallucination in Section 3.1, where we will 
describe the relation to the light field. For the purpose of a visual decision task, a COm~ LIGHT FIELD 
plete representation is as close as we can get to reality (the scene £) starting from the 
data. 

2.5 Actionable Information and Complete Information 

The complexity of the data, measured in various ways, e.g. via coding length or algo- 
rithmic complexity [118], has traditionally been called "information" in the context of 
data compression and transmission. If we think of an image as a distribution of pixels, 
then the entropy of this distribution is sometimes used as a measure of its complexity, 
or "information" content [ ]. Although the complexity of an image may be relevant 
to transmission and storage tasks (the most costly signal to transmit and store is white 
noise), it is in general not relevant to decision or control tasks. Extremely complex 
data, where all the complexity arises from nuisance factors, is useless for visual deci- 
sions, so one could say that such data is "uninformative." Vice- versa, there could be 
very simple data that are directly relevant to the decision. So, the complexity of the 
data itself is not a viable measure of the information content in the data/or the purpose 
of visual decisions. A natural image is not just a random collection of pixels; rather, 
an image is a sample from a distribution of (natural) images, p(I), not a distribution in 
itself. So, if we want to measure the "informative content" of an image, we have to do 
so relative to the scene. 

Setting aside technicalities that arise when the distributions are over continuous 
infinite-dimensional spaces, we define entropy formally as H(I) = E[logp(I)] where 
the expectation is with respect to p itself; that is, H(I) = J log p(I)dP(I) (see [44] 
for details). Entropy measures the "volume" of the distribution p, and is used as a 
measure of "uncertainty" or "information" about the random variable /. Other mea- 
sures of complexity, such as coding length [ ] or algorithmic complexity [108, 118], 
can be also related to entropy. The mutual information !(/;£)[ ] between the im- 
age and the scene is given by H(£) — H(£\I). It is the residual uncertainty on the 
scene £ when the image I is given. This would be a viable measure of the informa- 
tion content of the image, if one were able to calculate it. In a sense, this manuscript 
explores ways to compute such a mutual information. In Section 8.4.4 we will ex- 
plore the relation between Actionable Information and the conditional entropy of the 
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ACTIONABLE INFORMATION 



image given an estimated description of the scene. In particular, using the properties 
of mutual information, we have that I(£; I) = H(I) — H(I\£), and the latter denotes 
the uncertainty of the image given a description of the scene. This only depends on the 
sensor noise and other unmodeled phenomena, for all other nuisances are encoded in 
£, so it is not indicative of the informative content of the image. On the other hand, if 
most of the uncertainty in the image I is due to nuisance factors, the quantity H(I) is 
also not indicative of the informative content of the image. So, instead of H(I), what 
we want to measure is H((j) A (I)), that discounts the effects of the nuisances. This is 
called actionable information [172] and formalizes the notion of information proposed 
by Gibson [66] : 



H(I) = H{<P\I)). 



(2.20) 



COMPLETE INFORMATION 



Now, it is possible that the maximal invariant of the data </> A (I) contains no information 
at all about the object of inference, in the sense that the performance of the task (for 
instance a classification task) is the same with or without it. What would be most 
useful to perform the task would be a complete representation, £. Of all statistics of 
the complete representation we are interested in the smallest, so we could measure the 
complete information as the entropy of a minimal sufficient statistic of the complete 
representation. 



Hi = H(0 



(2.21) 



We will defer the issue of computing these quantities to Section 8, although one could 
already conjecture that 

H(I) < H v (2.22) 



It is important to notice that, whereas the scene £ consists of complex objects (shapes, 
reflectance functions) that live in infinite-dimensional spaces that do not admit simple 
base measures, let alone distributions of which we can easily compute the entropy, 
the representation £ is usually a finite-dimensional object supported on a zero-measure 
subset of the image, on which we can define a probability distribution, and compute 
entropy with standard tools. For instance, in [ ] it is shown that even when the 
images are thought of as surfaces (with infinite resolution) the representation is a tree 
with a finite number of nodes (for a bounded region of space), called Attributed Reeb 
Tree (ART). 



Example 1 It is interesting to notice that there are cases when H(I) = H%. This 
happens, for instance, when the only existing nuisances have the structure of a group. 
For instance, if contrast is the only nuisance, then the geometry of the level lines ( or 
equivalently the gradient direction) is a complete contrast invariant. Likewise, for 
viewpoint and contrast nuisances, the ART [177] is a complete feature. In general, 
however, this is not the case. 
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2.6 Optimality and the relation between features and 
the classifier 

Eliminating nuisances via marginalization or extremization yields optimal classifica- 
tion. Eliminating nuisances via the design of invariant features yields optimal classifi- 
cation only if the features have the structure of a group, i.e. v = 0. In this case, one can 
"pre-process" both the data and the training set to eliminate the effects of g, and design 
an equivalent statistical model (called an "invariant" model), and an equi-variant clas- 
sifier. In Section 4 we will show constructive ways of designing invariant features, via 
the use of co-variant detectors and their corresponding invariant descriptors. 

Optimality, as we have defined it in (2.2), does not impose restrictions on the set of 
classifiers, and the data processing inequality (2.14) stipulates that any pre-processing 
can at best keep the Bayesian risk constant, but not decrease it. In the presence of only 
invertible nuisances, v = 0, it is sensible to compute the maximal invariant to eliminate 
group nuisances, but non-invertible nuisances can only be eliminated without a loss at 
decision time, via marginalization or extremization. This puts all the burden on the 
classifier, that at decision time has to compute a complex integral (2.12), or solve a 
complex optimization problem (2.13). 

An alternative to this strategy is to constrain the choice of classifiers by limiting 
the processing to be performed at decision time. For instance, one could constrain 
the classifiers to nearest-neighbor rules, with respect to the distance between statis- NEAREST-NEIGHBOR 
tics computed on the test data (features) and statistics computed on the training data 
(templates). Two questions then arise naturally: What is the "best" template, if there is 
one? Of course, even choosing the best template, a feature-template nearest neighbor is 
not necessarily optimal. Therefore, the second question is: When is a template-based 
approach optimal? We address these questions next. 

2.6.1 Templates and "blurring" 

This section refers to a particular instantiation of visual decision problems, where the 
set of allowable classifiers is constrained to be on the form 

c = arg min d (/,/ c ) = - 0(/ c )|| (2.23) 

cG{0,l} 

for some statistic <p. Here I c is a function of the likelihood p(I\c) that can be pre- 
computed, and is called a template. If the likelihood is given in terms of samples template 
(training set) {Ik}^ =1 ~ p(I\c), then the template can be any statistic of the (training) training set 
data. In particular, a distance can be defined by designing features (j) that are invariant 
to the group component of the nuisance g G G. In this case, the space (f)(1) = X/G is 
the quotient of the set of images modulo the group component of the nuisance, which 
is in general not a linear space even when both X and G are linear. The distance above 
is a cordal distance, that does not respect the geometric structure of the quotient space 
X/G. A better choice would be to define a geodesic distance, or a distance between 
equivalence classes, d<3 ([</>(/)], [0(/ c )]). The structure of orbit spaces under the action 
of finite-dimensional groups is well understood [1 14] and therefore we will not belabor 
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this issue here (see Appendix A.l). Instead, we focus on the two critical questions: 
First, what is the "best" template I c , and how can it be computed from the training 
set? Second, are there situations or conditions under which this approach can yield 
the same performance of the Bayes or ML classifiers? (Section 3). In Section 6 we will 
show that one can build the equivalent of a template also for the test set, provided that 

SUFFICIENTLY EXCITING SAMPLE a "sufficiently exciting " test sample is available. 

The first thing to acknowledge is that the "best" template depends on the class 
of discriminants (or distance functions) one chooses. We will therefore answer the 
question for the simplest case of the squared Euclidean distance in the embedding 
(linear) space of images R NxM . More in general, the distance and its corresponding 
optimal template have to be co-designed [116, 115]. In all cases, one can choose as 
optimal template the one that induces the smallest expected distance for each class. 
For the case of the Euclidean distance we have 

J c = argmin^ (J | c) [||/-/ c || 2 ]= / \\I - I c \\ 2 dP(I\c) (2.24) 

that is solved by the conditional mean and approximated by the sample mean obtained 
from the training set 

Ic= I IdP(I\c)= Ik= E H9k,tk,"k). (2.25) 

X J *^( J I C ) 9k ~ dP(g) 

v k ~ dP(v) 
£k ~ dQ c (0 

Note that the distribution of the training samples, that in the integral above acts as 
an importance distribution, has to be "sufficiently exciting" [ ], in the sense that the 
training set must be a fair sample from p(I\c). If this is not the case, for instance in 
the trivial instance when all the training sample are identical, then the optimal template 
cannot be constructed. Different instantiations of this notation (corresponding to dif- 
ferent choices of groups G, scene representation 5, and nuisances v, often not explicit 
but latent in the algorithms) yield Geometric Blur [19], where the priors dP(g) are not 
learned but sampled in a neighborhood of the identity, and DAISY [31, 185], where 
instead of the intensity the template is comprised of quantized gradient orientation his- 
tograms. More in general, many models of early vision architectures include filtering 
steps, that mimic the quotienting operation to generate the invariant (j) A (I), followed by 
a pooling or averaging operation, akin to computing the template above [32]. Note also 
that the choice of norm, or more in general of classification rule, affects the form of the 
template. For instance, if instead of the £ 2 norm we consider the I 1 norm, the resulting 
template is the median, rather than the mean, of the sample distribution. Other statistics 
are possible, including the (possibly multiple) modes, or the entire sample distribution 
[115]. 

Note that the relationship between a template-based nearest-neighbor classifier and 
a classifier based on the proper likelihood is not straightforward, even if the class con- 
sists of a singleton - which means that it could be captured by a single "template" if 



2.6. OPTIMALITY AND THE RELATION BETWEEN FEATURES AND THE CLASSIFIER3 1 



there were no nuisances. The marginalized likelihood is 

exp (-||7 - HI) dP(g)dP{v) (2.26) 



/< 



assuming a normal density for the additive residual n, with covariance E, where || • ||s is 
a Mahalanobis norm, = v T T^~ 1 v\ the nearest-neighbor template-based classifier 
would instead try to maximize 



exp -||J 



j h(g,i,v)dP{g)dP(v)\A . (2.27) 



The quantity bracketed is called the blurred template I c 

Ic = J h(g,£,v)dP(v)dP(g) (2.28) 

which does not depend on the nuisance not because it has been marginalized or max- 
outed, but because it has been "blurred" or "smeared" all over the template. This 
strategy does not rest on sound decision- theoretic principles, and yet it is one of the 
most commonly used in a variety of classification domains, from nearest neighbor [19] 
to support vector machines (SVM) [37] to neural networks [ ], to boosting [195]. 
Note that if instead of computing distance in the embedding space || I— I\\x we compute 
the distance in the quotient, Z/Q,we do not need to blur out the group in the template. 
In fact, the expectation of the quantity 

exp(-d G (0(/),0(/))) (2.29) 

is minimized by 



Hi)- J 



<j>oh(^v)dP(v). (2.30) 



Note that we have implicitly assumed that <fi acts linearly on the space X, lest we would 
have to consider 0(1), rather than (j>{I). We discuss linear features in more detail in 
Section 4.2. 

As for how the priors can be learned from the data, we defer the answer to Section 
9 since it is not specific to the use of templates. For now, we note that - should a 
prior be available - it can be used according to (2.25) for the case of classifiers based 
on features/templates, (2.12) for the case of Bayesian and (2.13) for the case of ML 
classifiers. 

The second question, which is under what conditions this template-based approach 
is optimal, is somewhat more delicate and will be addressed in Section 3.4. The short 
answer is never, in the sense that if it were optimal, there would be no need for aver- 
aging. However, a template approach can be advocated when there are constraint on 
decision-time, and at least some of the nuisances can be eliminated via canonization, 
as described in Section 3, and the residual uncertainty is described by a uni-modal 
class-conditional density that is well captured by its mode or mean (2.30). 



BLURRED TEMPLATE 
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Figure 2.5: Complex nuisances (quantization, contrast, partial occlusions) can be so 
severe as to render a (single) template, I c above, largely indiscriminative. Marginal- 
ization may enable recognition, especially if guided by priors. In some cases, even that 
does not work. However, extremization of the nuisances given the class can still be per- 
formed from (2.13). For instance, given the class c = "dalmatian" (left) and c = "back 
of a nude" (right), one can easily determine viewpoint (pose) g, and occlusions (loca- 
tion) v from (2.13). 



When the template is computed on the test data, it can be used as a "descriptor' \ 
that is, an invariant feature. More general descriptors will be discussed in the next 
chapter, and how to construct them from data will be described in Section 6. 

Remark 1 (On the use of the word "template") The term template refers to a large 
variety of approaches, only some of which are captured by the definition we have given 
in this chapter. In particular, Deformable Templates, studied in depth by Grenander 
and coworkers [71 ], are based on premises that are not valid in our setting. In fact, in 
Deformable Templates there is an underlying hidden variable, the "template " ( which 
could be given or learned), that is acted upon by a "deformation " ( typically an infinite- 
dimensional group of diffeomorphisms). However, in Grenander 's approach the group 
acts transitively on the template, meaning that from the template one can "reach" any 
object of interest. In other words, there is a single orbit that covers the entire space of 
the measurements. In this case, all the "information " is contained in the deformation 
(the group), that is therefore not a nuisance. 
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Chapter 3 

Canonized Features 



Invariant features can be designed in a number of ways. In this section we describe a 
constructive approach called canonization} that leverages on the notion of co-variant 
detector and its associated invariant descriptor. The basic idea is that a group G act- 
ing on a space S organizes it into orbits, [£] = {g£ V g G G} each orbit being an 
equivalence class (reflexive, symmetric, transitive) representable with any one ele- EQUIVALENCE CLASS 
ment along the orbit. Of all possible choices of representatives, we are looking for 
one that is canonical, in the sense that it can be determined uniquely and consistently 
for each orbit. This corresponds to cutting a section (or base) of the orbit space. All 
considerations (defining a base measure, distributions, discriminant functions) can be 
restricted to the base, which is now independent of the group G and effectively rep- 
resents the quotient space X/G. Alternatively, one can use the entire orbit [£] as an 
invariant representation, and then define distances and discriminant functions among 
orbits, for instance via max-out, [£2]) = mm #i,# 2 eG ^G7i£i> 92&)- 

The name of the game in canonization is to design a functional - called feature de- 
tector - that chooses a canonical representative for a certain nuisance g that is insensi- 
tive to (ideally independent of) other nuisances. We will discuss the issue of interaction 
of nuisances in canonization in Section 3.4. Before doing so, however, we recall some 
nomenclature. 

Definition 1 (Invariant Feature) A feature cj> : X — )> R K is any deterministic function 
of the data taking values in some vector space, I \-> 4>{I). Considering the formal 
generative model (2.5), a feature is G-invariant if 

0o%,e,z,) = 0o/z(e,e,z/), MgeG (3.1) 

and for all in the appropriate spaces, where e G G is the identity transformation. 

lr The name "canonization" comes from the fact that a co-variant detector determines a canonical frame, 
or a canonical element of the group. In an ecclesiastic context, canonization is the elevation of an individual 
to one of the steps to the ladder of sanctitude. Similarly, a co- variant detector has the authority to elevate 
a group element to be "special" in the sense of determining the frame around which the data is described. 
However, this must be followed by additional steps, such as commutativity and proper sampling, that are 
discussed in future chapters. 
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In other words, an invariant feature is a function of the data that does not depend on 
the nuisance. Note that we are focusing on the group component of the nuisance, for 
reasons explained in the previous chapter and further elaborated in Section 3.4. We 
recall the definition of representation from Section 2.4.4: 

Definition 2 (Representation) Given a collection of images {I}, a feature £ G S is a 
representation if{I} G £(£), where £(£) = {h(g, £, z/), # G G, ^ G V}. Equivalently, 
£ G £ _1 ({/}). Given <2 sce^e £, a representation £ is complete if i* satisfies the 
compatibility condition £(£) = a minimal complete representation is a minimal 
sufficient statistic of C(£). 

A representation is three things at once: It is a statistic, that is a function of the images. 
However, it is embedded in the space of scenes, so it can be thought of as a scene 
itself. For instance, given a single image, under the Lambert- Ambient- Static model 
(2.6) of Section 2.2, one can construct a representation that is a plane (shape S) with 
the image I glued onto it, so p = I. Finally, the representation is a finite-complexity 
structure that can be stored in the memory of a digital computer. Even if the data had 
infinite complexity (infinite-resolution images), the analysis of [177] suggests that the 
representation occupies an infinitesimal volume in the space of scenes. We call it a 
"feature," even though it lives in the embedding space of the scene, because, as we will 
show in Section 8, it can be computed from data. 

A minimal complete representation, which we refer to as a "representation" without 
additional qualifications when clear from the context, would be the ideal feature, in 
the sense that it captures everything about the scene that can be gathered from the data 
except for the effect of the nuisances. When non-invertible nuisances are absent, v = 0, 
a representation can be used as a representative of the orbits (equivalence classes) [£] : 



Clearly the absence of non-invertible nuisances is a wild idealization, that is made here 
just as a pedagogical expedient. The case where other nuisances are present, v ^ 0, 
requires some attention and will not be fully addressed until after Section 3.4. In the 
next section, however, we pause to elaborate on the notion of representation and what 
it entails. Then, we study the groups G for which complete features exist (Section 3.4). 



If there were only invertible nuisances, there would be no need for a notion of repre- 
sentation, since the equivalence class [h(g, £, 0)] is a complete feature and it can be 
inferred from a single datum (one image). To begin understanding the notion of repre- 
sentation in the presence of non-invertible nuisances, we need to go back to the image 
formation model (2.6), and in particular to the pre-image of a point x G D on the 
domain of an image (2.7): 



£ = 4> A (%,£,0))~/i(e,£,0). 



(3.2) 



3.1 Hallucination an representation 



n s 1 (x) = {pGS\Tr(p) = x}. 



(3.3) 



3.1. HALLUCINATION AN REPRESENTATION 
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Figure 3.1: A representation enables the synthesis of an arbitrary number of images, and an 
image is compatible with infinitely many representations. Kanisza's triangle (top) is an image 
that is compatible with multiple representations (e.g. the two scenes on the bottom). There is 
no way from an image alone or without further exploration, to ascertain which representation 
is "correct." A unique interpretation can be forced by imposing priors or other model selection 
criteria (e.g. minimum description), but the process is problematic from an epistemological 
stance. On the other hand, if one is allowed to gather more data, for instance as part of an 
exploration process as described in Section 8, the set of representations that is compatible with 
the data shrinks. For instance, of all the representations compatible with the top image, only a 
subset will also be compatible with the bottom-right, and another subset with be compatible with 
the bottom right. While it is not possible to say, even asymptotically with an infinite amount 
of data, whether the representation approaches in some sense the ''true" scene, it is possible 
to validate whether the representation is compatible with the true scene in the sense of X(£) = 
In the case above, given the top image and one of the bottom two, it is possible to determine 
whether the representation is compatible with two triangles and three discs or three pac-man 
figures and three wedges. 



We note that this set may contain multiple elements, and in particular all points that lie 
on the same projection ray x: 

ns 1 (x) = {xZ t (x)eS}?j* ) . (3.4) 

We recall from (2.7) that we sort the depths Z{ in increasing order, Z\{x) < Z<i{x) < 
• • • < Zn(x), and when we want to restrict the pre-image to the point of first intersec- 
tion, Z(x) = Zi(x), we indicate the (unique) pre-image via tt~ 1 (x) = xZ(x) where 
Z(x) = Z\(x) (i.e. , we forgo the subscript S, even though the pre-image of course 
does depend on the shape of the scene S). Note that both tt~ 1 (x) and tt^ 1 (x) de- 
pend on the viewpoint g and on the geometry and topology of the scene. They depend 



36 



CHAPTER 3. CANONIZED FEATURES 



DETACHED OBJECTS 



IDEAL IMAGE 



on how many simply connected components 2 Si there are ("detached objects"), how 
many holes, openings, folds, occlusions etc. Note that we have sorted the depths while 
allowing the possibility of multiple points at the same depth. This happens in the limit 
when x approaches 3 an occluding boundary. 

Remark 2 (The "ideal image") In Section 2.4.4 we have hinted at the fact that the 
scene itself may not be a viable representation of the images that it has generated, 
which may appear strange at first. Indeed, the "true" scene may never be known, 
because it exists at a level of granularity that is finer than any instrument we have 
available to measure it. So, we only have access to the "true" scene via the data 
formation process (visual or otherwise). This is unlike a representation £, that can be 
arbitrarily manipulated in order to generate any image I G £(£). 

In the presence of occlusions, evaluating the (hypothetical) "ideal image" h(e, £, 0), 
that is the image that would be obtained if there were no nuisances, requires the "inver- 
sion " of the occlusion process, including a description of the scene both in the visible 
and in the non-visible portions. In formulas, from (2.6) and (2.7), we have 



h(e,t,0)=p(TTs 1 (D)). 



(3.5) 



Note that what this object returns is not the shape S of the scene, but the radiance of the 
scene in both the visible and the occluded portions of the scene. So, in a sense, it is an 
image of a "semi-transparent world" made of many simply connected surfaces. Note, 
however, that the 3-D geometry may be recovered if we can control the data acquisition 
process via h(g,£,0), with g = g(u) as described in Chapter 8, because by doing so we 
can generate the collection of all possible occluding boundaries, which is equivalent to 
an approximation of the 3-D geometry of the scene [210]. In any case, the hypothetical 
image h(e, £, 0) captures the topology of the world, reflected in the number of layers N 
present in the pre-image. While the notion of ideal image is awkward, and we will soon 
move beyond it, we need to consider it for a little while longer to ascertain what would 
be possible (or may be possible, depending on the sensing modality) if there were only 
invertible nuisances. 



As we have anticipated in Section 2.4.4, building a representation from a single 
image would not be very useful, because in general the space of all scenes S is much 
larger than the space of all images X, and lifting the image onto the scene yields no 
practical benefit. However, one may be able to construct a representation £ that is 
compatible with a collection of images, {I k }^ =1 , m me sense that 

h = Kg k ,£,v k ) Irifc- h(g k ,€,C> k ) (3.6) 

for all k = 1,...,K and for some g k ,0 k , but the same £. In fact, once we have an 
hallucinated scene £, we can produce infinitely many "virtual" images, that we can 
then compare with any real image that is actually measured. In fact, £(£) produces the 

2 A detached object is a simply connected component Si of the scene. The scene is made of multiple 
components, S = U^, as defined in Section 2.2. 

3 Note that occluding boundaries depend on the vantage point, which is usually treated as a nuisance g. 



3.2. OPTIMAL CLASSIFICATION WITH CANONIZATION 



37 



set of all possible hallucinated images of the scene £, and is therefore equivalent to its 
light field 4 [ ] . While the hypothesis £ = £ cannot be tested (we do not even have a 
metric in the space 5), the hypothesis h(g 1 ^ 1 v) ~ ft (<?,£,£) for some g, z> can be very 
simply tested by comparing two images, which is a task that poses no philosophical or 
mathematical difficulties. In compact notation, what we can test is £ G C~ l (£(£)). 
We will return to this issue in Sections 5.1 and 8. 

So far we have described the hallucination process, by which a representation can 
be used to generate images. The inverse process of hallucination is exploration, by EXPLORATION 
which one can use images to construct a representation. In Section 8 we will show 
how to perform this limit operation. Exactly how the representation £, built through 
exploration, is related to the actual "real" scene £, for instance whether they are "close" 
in some sense, hinges on the characterization of visual ambiguities, which we discuss 
in Appendix B.3. 

We will come back to the notion of representation after establishing what could be 
done if only invertible nuisances were present, and establishing exactly what nuisances 
are indeed invertible. 

3.2 Optimal classification with canonization 

In this section and the next we elaborate on what would be possible if all nuisances 
were invertible. 5 So, we assume that v = 0, and focus on the role of g. We will 
address v ^ in Section 3.4. For now, we are interested in how to design invariant and 
sufficient statistics for g alone, as if all other nuisances were absent. 

One of the many possible ways of designing an invariant feature is to use the data 
I to "fix" a particular group element g(I), and then "undo" it from the data. If the data 
does not allow fixing a group element g, it means it is already invariant to G. So, we 
define a (co-variant) feature detector to be a functional designed to choose a particular 
group action g, from which we can easily design an invariant feature, often referred to 
as an invariant (feature) descriptor. Note that both the detector and the descriptor are 
deterministic functions of the data, hence both are features. 

4 A representation enables generating images with any nuisance combinations. In this sense it is "equiv- 
alent" to the light field, because it can generate any image that a sampling of the light field would produce. 
How to store this representation, and how it is compatible with the computational architecture of the pri- 
mate's brain, is well outside the scope of this manuscript. However, others have tackled this question, most 
notably Koenderink and van Doom, who have studied in depth the structure of the representation (even 
though it is not called a representation) and its relation to the light field, and also coined the phrase "images 
as controlled hallucinations." We refer the reader to their work ([ ] and references therein) for further 
investigations on this topic. In our case, we will give a constructive example on how to build and store a 
representation through the exploration process in Section 8. 

5 Note that this situation is hypothetical, for even a group g (e.g. vantage point) acting on the scene (£) can 
generate non-invertible nuisances (self-occlusions v). According to the Ambient-Lambert model, a group 
transformation gt G SE(3) due to a change of vantage point induces on the image domain an epipolar 
transformation that, depending on the shape of the scene, can be arbitrarily complex. In particular, such a 
transformation need not be a function in the sense that self-occlusion will cause the pre-image of a point p 
via the domain deformation w~ 1 (x) to have multiple values. This issue will be resolved in later sections; for 
now, we assume that the scene is such that a group nuisance does not induce non-invertible transformations 
of the image domain. In particular, it has been shown in [177] that - in the absence of occlusion phenomena 
- the closure of epipolar domain deformations has as closure the entire group of planar diffeomorphisms. 
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We consider the set of digital images X to be (piece-wise constant) functions I : 
R 2 R 2 ; x G B € (xij) ' — y Iij, that can be identified with the set of matrices R NxM . 
A differentiate functional i/j : X x G —> R; h-> i/)(I,g) is said to be foca/, 

with effective support a if its value at g only depends on a neighborhood of the image 
of size a > 0, up to a residual that is smaller than the mean quantization error. For 
instance, for a translational frame q, if we call ii , , an image that is identical to 
/ in a neighborhood of size a centered at position g = T, and zero otherwise, then 
^^B^g)^) = i/>(I,g) +n> with \h\ < j^Y^ij \ n ij\- For instance, afunctional 
that evaluates the image at a pixel g = T = x G B e (xij), is local with effective 
support e. For groups other than translation, we consider the image in the reference 
frame determined by g, or equivalently consider the "transformed image" / o g~ l , in a 
neighborhood of the origin, so ip(I,g) = ^(/o^ _1 ,e). 

If we call Vip = the gradient of the functional ip with respect to (any) parametriza- 
tion of the group, 6 then under certain (so-called "transversality") conditions on ip, the 
equation Vip = locally determines g a function ofI,g = g(I), via the Implicit Func- 
tion Theorem. Such conditions are independent of the parametrization and consist of 
the Hessian matrix Hfy) = VV^ (a.k.a. the Jacobian of V0) being non-singular, 
det (H(ip)) 7^ 0. The function g is unique in a neighborhood where the transversality 
condition is satisfied, and is called a (local) canonical representative of the group. If the 
canonical representative co-varies with the group, in the sense that g(Iog) = (gog) (/), 
then the functional ip is called a co-variant detector. Each co-variant detector deter- 
mines a local reference frame so that, if the image is transformed by the action of the 
group, a hypothetical observer attached to the co-variant frame (i.e., a "Lagrangian" 
observer) would see no changes. We summarize this in the following definition: 

Definition 3 (Co-variant detector) A differentiable functional ij) : XxG — » R; (I,g) \-> 
^(J, g) is a co-variant detector if 

1. The equation det (H(ip(I, g))) = locally determines a unique isolated ex- 
tremum in the frame g G G, and 

2. if\7?p(I, g) = 0, then °g^g°g) = 0\/g^G, i.e. , ip co-varies with G. 

The notation log indicates the map (J, g) — (ft(e, £, 0), g) \-> h(g, ^,0) = I o g. This 
may seem a little confusing at this point, but it will be clarified in Sect. 3.4.1. The 
first "transversality" condition [73] corresponds to the Jacobian of with respect to 
g being non-singular: 

\Jv^\^0. (3.7) 

In words, a co- variant detector is a function that determines an isolated group element 
in such a way that, if we transform the image, the group elements is transformed in the 
same manner. Examples of co- variant detectors will follow shortly. 

Definition 4 (Canonizability) We say that the image I is G-canonizable ( is canoniz- 
able with respect to the group G), and g G G is the canonical element, if there exists a 
covariant detector ip such that V^(^, g) = 0. 



6 The following discussion is restricted to finite-dimensional groups, but it could be extended with some 
effort to infinite-dimensional ones. 
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Note that, depending on the functional ip, the canonical element may be defined 
only locally, i.e. the functional may only depend on I(x)\ xeBGD , that is a restriction 
of the image to a subset B of its domain D. In the latter case we say that I is locally 
canonizable, or, with an abuse of nomenclature, we say that the region B is canonizable. 



The transversality condition (3.7) guarantees that g, the canonical element, is an iso- 
lated (Morse) critical point [136] of the derivative of the function ip via the Implicit 
Function Theorem [73]. So a co-variant detector is a statistic (a feature) that "extracts" 
a group element g. With a co- variant detector we can easily construct an invariant de- 
scriptor, or local invariant feature, by considering the data itself in the reference frame 
determined by the detector: 

Definition 5 (Canonized descriptor) For a given co-variant detector ip that fixes a 
canonical element g via Vip(I, g(I)) = we call the statistic 



<f>(I) = Iog-\l) | W(/,s(/)) = 0. 



(3.8) 



an invariant descriptor. 



7 Since canonization is possible only for groups acting on the domain of the data (i.e. , images), in the 
presence of more complex nuisances one has to exercise caution in that the group g acting on the scene may 
induce a different transformation on the domain of the image, which may not even be a group. For instance, 
a spatial translation along a direction parallel to the image plane does not induce a translation of the image 
plane, unless the scene is planar and fronto-parallel. 



LOCAL INVARIANT FEATURE 



A trivial example of canonical detector and its corresponding descriptor can be de- 
signed for the translation group. Consider for instance the detector ip that finds the 
brightest pixel in the image, and assigns its coordinates to (0, 0). Relative to this ori- 
gin, the image is translation-invariant, because as we translate the image, so does the 
brightest pixel, and in the moving frame the image does not change as we translate. 
Similarly, we can assign the value of the brightest pixel to 1, the value of the darkest 
pixel to 0, linearly interpolate pixel values in between, and we have an affine contrast- 
invariant detector. 

As far as eliminating the effects of a group, all covariant detectors are equivalent. 7 
Where they differ is in how they behave relative to all other nuisances. Later we will 
give more examples of detectors that are designed to "behave well" with respect to 
other nuisances. In the meantime, however, we state more precisely the fact that, as far 
as dealing with a group nuisance, all co-variant detectors do the job. 

Theorem 1 (Canonized descriptors are complete features) Let ip be a co-variant de- 
tector. Then the corresponding canonized descriptor (3.8) is an invariant sufficient 
statistic. 

Proof: To show that the descriptor is invariant we must show that <p{I o g) = (p{I). 
But<t>{Iog) = (Iog)og-\log) =Iogo(gg)- 1 = I o g o g' 1 g-\l) = I o g' 1 (I) . 
To show that it is complete it suffices to show that it spans the orbit space T/G, which 
is evident from the definition (p(I) = I o g~ x . 
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The notation log is slightly ambiguous at this point, but will be clarified in Sect. 3.4.1. 

Note that invertible nuisances g may act on the domain of the image (e.g. affine trans- 
formations due to viewpoint changes), as well as on its range (e.g. contrast transforma- 
tions due to illumination changes). 

Example 2 (SIFT detector and its variants) To construct a simple translation- covariant 
detector, consider an isotropic bi-variate Gaussian function 

M(x\ /i, a 2 ) = exp( — )> then for any given scale a, the Laplacian-of- 

Gaussian (LoG) ^>(J, g) = V 2 A/"(x; g, a 2 ) * I(x) is a linear translation- covariant de- 
tector. If the group includes both location and scale, so g = (g, a 2 ), then the same func- 
tional can be used as a translation-scale detector. Other examples are the difference- 
of-Gaussians (DoG) ip(I,g) = ^fe^ )-N\x,g,k a ) ^ j^ x ^ w f tn typically k = 1.6, 
and the Hessian- of- Gaussian (HoG) is ijj(I,g) = det H(Af(x; g, cr 2 )). Among the 
most popular detectors, SIFT uses the DoG, as an approximation of the Laplacian. 

Example 3 (Harris' corner and its variants) Harris' corner and its variants (Stephens, 
Lucas -Kanade, etc.) replace the Hessian with the second-moment matrix: 

i/j(I,g) = det ( / V T IVI(x)dx J . (3.9) 

\JB.ig) J 

One can obtain generalizations to groups other than translation in a straightforward 
manner by replacing J\f(x; g,cr 2 ) with 27rcr det j ex P(~ 2 H-2 det(J^) 2 ) wnere Jg ^ tne 
cobian of the group. For instance, for the affine group g(x) = Ax + b, we have that 

i>{I,g) = V 2 ( 27ra det a ex P(~ ^^deTpp )) is an offine-covariant (Laplacian) de- 
tector. One can similarly obtain a Hessian detector or a DoG detector. The Euclidean 
group has A G SO (2), so that det A = 1, and the similarity group has a A, with 
determinant a. 

As we will see in Example 6, this functional has some limitations in that it is not 
a linear functional, and therefore it does not commute with additive nuisances such as 
quantization or noise. 

Example 4 (Harris- Affine) The only difference from the standard Harris 9 corner is 
that the region where the second-moment matrix is aggregated is not a spherical neigh- 
borhood of location g with radius a, but instead an ellipsoid represented by a location 
T G D C M 2 and an invertible 2x2 matrix A G GL(2). In this case, g = (T, A) is the 
affine group, and the second-moment matrix is computed by considering the gradient 
with respect to all 6 parameters T, A, so the second-moment matrix is 6x6. However, 
the general form of the functional is identical to (3.9), and shares the same limitations, 
as we will see, in that affine -covariant canonization necessarily entails a loss. 

Although these detectors are not local, their effective support can be considered to 
be a spherical neighborhood of radius a multiple of the standard deviation a > 0, so 
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they are commonly treated as local. Varying the scale parameter a produces a scale- 
space, whereby the locus of extrema of ip describes a graph in R 3 , via (x, a) \-> x = 
g(I; a). Starting from the finest scale (smallest a), one will have a large number of 
extrema; as a increases, extrema will merge or disappear. Although in two-dimensional 
scale space extrema can also appear as well as split, such genetic effects (births and 
bifurcations) have been shown to be increasingly rare as scale increases, so the locus 
of extrema as a function of scale is well approximated by a tree, which we call the 
co-variant detection tree [116]. 

Remark 3 (Canonizability, saliency and the "sketch") The notion of canonizability 
is related to the notion of "sketchability" defined in [74], although the latter is in- 
troduced without direct ties to a specific task, and motivated in [126]. In the case 
of this manuscript, the notion of canonizability arises naturally from visual classifi- 
cation tasks, in the sense of providing a vehicle to design an invariant descriptor. It 
also relates to the notion of "saliency" defined in [189, 85], although again the latter 
stems from resource limitations (foveal sensing) rather than from a direct tie to a visual 
classification task. 

Remark 4 (Invariant descriptors without co- variant detectors) 

The assumption of differentiability in a co-variant detector can be easily lifted; in Sec- 
tion 4.3 we will show how to construct co-variant detectors that are not differentiable. 
Indeed, canonization itself is not necessary to design invariant descriptors. We have 
already mentioned "blurring" as a way to reduce (if not eliminate) the dependency 
of a statistic on a group, although that does not yield a sufficient statistic. However, 
even for designing complete (i.e. invariant and sufficient) features, canonization is not 
necessary. For instance, the geometry of the level curves - or its dual, the gradient 
direction - is a complete contrast-invariant which does not require a contrast-detector. 
Indeed, even the first condition in the definition of a co-variant detector is not necessary 
in order to define an invariant descriptor: Assume that the image I is such that for any 
functional ip, the equation Vip(I, g) — does not uniquely determine g = g(I). That 
means that | Jy^ | = Ofor all ip, and therefore all statistics are already (locally) invari- 
ant to G. More in general, where the structure of the image allows a "stable" and "re- 
peatable" detection 8 of a frame g, this can be inverted and canonized (f)(1) = I o g~ x . 
Where the image does not enable the detection of a frame g, it means that the image it- 
self is already invariant to G. These statements will be elaborated in Section 3.5 where 
we introduce the notion of texture. 

Intuitively, if a covariant detector is "unstable", i.e. , |Jy^| — 0, then any function 
<p(I) is "insensitive" to g, in the sense that, assuming ip o I to be smooth, we have 
cp(I) ~ cp(I o g). This means that what we cannot canonize does not matter towards 
the goal of designing invariant (insensitive) statistics; it is already invariant (insensi- 
tive). Of course, these statistics will not be localized. In particular, the definition of 
canonizability, and its requirement that g be an isolated critical point, would appear 
to exclude edges and ridges, and in general co-dimension one critical loci that are not 
Morse critical points. However, this is not the case, because the definition of critical 

8 "Stability" will be captured by the notion of Structural Stability, and "repeatability" by the notion of 
Proper Sampling. 
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point depends on the group G, which can include discrete groups ( thus capturing the 
notion of "periodic structures," or "regular texture") and sub-groups of the ordinary 
translation group, for instance planar translation along a given direction, capturing 
the notion of "edge" or "ridge" in the orthogonal direction. We will come back to the 
notion of stability in Section 4.1. 

Remark 5 (Dense descriptors) We emphasize that detectors ' only purpose is to avoid 
marginalizing the invertible component of the group G. However, at best such detec- 
tors can yield no improvement over marginalizing the action of G, that is to use no 
detector at all (Section 2.4.1). Therefore, one should always marginalize or max-out 
the nuisances if this process is viable given resource constraints such as the need to 
minimize processing at decision time. This is a design choice that has been explored 
empirically. In visual category recognition, some researchers prefer to use features 
selected around "keypoints," whereas others prefer to compute "dense descriptors" at 
each pixel, or at a regular sub-sampling of the pixel grid, and let the classifier sort out 
which are informative, at decision time. 

Remark 6 (Preview: Aliasing and Proper Sampling) Canonizability, as we have de- 
fined it, entails the computation of the Jacobian, which is a differential operation on 
the image. However, images are discrete, merely a sampled version of the underlying 
signal (assuming that is piecewise differentiable), that is the radiance of the scene. In 
any case, the differentiable approximation, or the computation of the Jacobian, entails 
a choice of scale, depending on which any given "structure " may or may not exist: A 
differential operator such as the Jacobian could be invertible at a certain scale, and 
not invertible at a different scale at the same location. Because the "true" scale is 
unknown (and it could be argued that it does not exist), canonizability alone is not 
sufficient to determine whether a region can be meaningfully canonized. "Meaning- 
ful" in this context indicates that a structure detected in an image corresponds to some 
structure in the scene, and is not instead a sampling artifact ("aliasing") due to the 
image formation process, for instance quantization and noise. Therefore, an additional 
condition must be satisfied for a region to be "meaningfully" canonized. This is the 
condition o/Proper Sampling that we will introduce in Section 5.1. 

3.3 Optimality of feature-based classification 

The use of canonization to design invariant descriptors requires the image to support 
"reliable" (in the sense of Definition 3) co-variant detection. As we have discussed in 
Remark 4, the challenge in canonization is not when the co- variant detector is unreli- 
able, for that implies the image is already "insensitive" to the action of G. Instead, the 
challenge is when the covariant detector reliably detects the wrong canonical element 
g, for instance where there are multiple repeated structures that are locally indistin- 
guishable, as is often the case in cluttered scenes. We will come back to this issue in 
Section 5.2. 

The good news is that, when canonization works, it simplifies visual classification 
by eliminating the group nuisance without any loss of performance. 



3.4. INTERACTION OF NUISANCES IN CANONIZATION: WHAT IS REALLY CANONIZABLE743 



Theorem 2 (Invariant classification) If a complete G -invariant descriptor £ = <j>{I) 
can be constructed from the data /, it is possible to construct a classifier based on the 
class- conditional distribution dP(£\c) that attains the same minimum (Bayesian) risk 
as the original likelihood p (I \c). 

The proof follows from the definitions and Theorem 7.4 on page 269 of [159]. The 
classifier based on the complete invariant descriptor is called equi-variant. 

Based on this result, one would surmise that it is always desirable to canonize group 
nuisances, whether or not they come with a prior dP(g) . 9 An important caveat is that, 
so far in this section, we have assumed that the non-invertible nuisance is absent, i.e. 
v — 0, or that, more generally, the canonization procedure for g is independent of v, 
or "commutes" with v, in a sense that we will make precise in Definition. 6. This is 
true for some nuisances, but not for others, even if they have the structure of a group, 
as we will see in the next section. There we will show that many group nuisances 
indeed cannot be canonized. These include the affine and projective group (scale, skew, 
deformation) as well as more complex nuisances that have to be dealt with either by 
marginalization (2.12), or by extremization (2.13). 



3.4 Interaction of nuisances in canonization: what is 
really canonizable? 

The previous section described canonization of the group nuisance g G G in the ab- 
sence of other nuisances v = 0. Unfortunately, some nuisances are clearly not invert- 
ible (occlusions, quantization, additive noise), and therefore they cannot be canonized. 
What is worse, even group nuisances may lose their invertibility once composed with 
non-invertible nuisances. 

In this section, we deal with the interaction between invertible and non-invertible 
nuisances, so we relax the condition v — and describe feature detectors that "com- 
mute" with v. We show that the only subgroup of G that has this property is the 
isometric group of the plane. That is, planar rotations, translations and reflections. ISOMETRIC GROUP 
Other nuisances, groups or not, have to be dealt with by marginalization or extremiza- 
tion if one wishes to retain optimal performance. This includes the similarity group 
of rotations translations and scale, that is instead canonized in [ ], and the affine SIMILARITY GROUP 
group, that is instead canonized in [134]. AFFINE GROUP 

In order to simplify the derivation, we introduce the following notation, in part 
already adopted earlier in the proof of Theorem 1 . The notation will be fully clarified in 
Sect. 3.4.1. If / is the "ideal image" (without nuisances, see Remark 2), / = /i(e, £, 0), ideal image 
then 

log = %,e,0) (3.10) 
lov = fc(e,f,i/). (3.11) 

9 This may seem confusing if one consider classification of hand-written digit modulo planar rotation. 
However, in the case of digits such as "6" and "9", the class identity depends on the group, which is therefore 
not a nuisance and should instead be part of the description £. 
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The operators (• o g) and (• o v) can also be composed, log o v = h(g, £, z/) and applied 
to an arbitrary image; for instance, if / = h(g, £, v), then for any other g, v we have 

Iogov = h(gg^,v(Bis) (3.12) 

where © is a suitable composition operator that depends on the space where the nui- 
sance v is defined. Note that, in general, the action of the group and the other nuisances 
do not commute: logov ^ lovog. When this happens we say that the group commutes 

commutative nuisance with the (non-group) nuisance: 

Definition 6 (Commutative nuisance) A group nuisance g G G commutes with a 
(non-group) nuisance v if 

Iogov = Iovog. (3.13) 

Note that commutativity does not coincide with invertibility: A nuisance can be invert- 
ible, and yet not commutative (e.g. the scaling group does not commute with quantiza- 
tion). 

For a nuisance to be canonizable without a loss, (i. e. , eliminated via pre-processing, 
or via a complete invariant feature) it not only has to be a group, but it also has to com- 
mute with the other nuisances. In the following we show that the only nuisances that 
commute with quantization are the isometric group of the plane. While it is common, 
following the literature on scale selection, to canonize it, scale is not canonizable with- 
out a loss, so the selection of a single representative scale is not advisable. 10 Instead, a 
description of a region of an image at all scales should be considered, since scale, in a 
quantized domain, is a semi-group, rather than a group. 



Interaction of group nuisances with quantization 

Note that, per Theorem 2, only for canonizable nuisances can we design an equi-variant 
classifier via a co- variant detector and invariant descriptor. All other nuisances should 
be handled via marginalization or extremization in order to retain optimality (minimum 
risk), or via a template if one is willing to sacrifice optimality in favor of speed at 
decision time. 

Theorem 3 (What to canonize) The only nuisance that commutes with quantization 
is the isometric group of the plane, that is the group of rotations, translations and 
reflections. 

Proof: We want to characterize the group g such that Iogov = Iovog where v is 
quantization. For a quantization scale a, we have the measured intensity (irradiance) 
at a pixel xi 

Iov( Xi )= \ I(x)dx= / xs tr (x i )(x)I(x)dx= / XB tr (o)(x-x i )I(x)dx 

<" 4 > 

10 Even rotations are technically not invertible, because of the rectangular shape of the pixels; however, 
this is a second-order effect that can be neglected for practical purposes. 
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where B a (x) is a ball of radius a centered at x, x is a characteristic function that is 
written more generally as a kernel Q(x\ a), for instance the Gaussian kernel Q(x\ a) = 
J\f(x] 0, a 2 ), allowing the possibility of more general quantization or sampling schemes, 
including soft binning based on a partition of unity of D rather than simple functions 
X- Now, we have 

(I o v) o g{pa) = (^J Q(x - x i ]a)I(x)dx S j o g = J Q{x - gxi\ o)I{x)dx (3.15) 
whereas, with a change of variable x' = gx, we have 
(I o g) o u(xi) = / Q(x - Xi\o)I{gx)dx = / Q{g~ x {x f - gxi)\ o)I{x')\ J g \dx' 




(3.16) 

where \ J g \ is the determinant of the Jacobian of the group G computed at g, so that the 
change of measure is dx' = \J 9 \dx. From this it can be seen that the group nuisance 
commutes with quantization if and only if 



(3.17) 



That is, the quantization kernel has to be G-invariant, Q(x;a) — Q(gx;a), and the 
group G has to be an isometry. The only isometry of the plane is the set of planar 
rotations and translations (the Special Euclidean group SE(2)) and reflections. The 
set of is ome tries of the plane is often indicated by E{2). 

Corollary 1 (Do not canonize scale (nor the affine group)) The affine group does not 
commute with quantization, and in particular the scaling and skew sub-groups. As im- 
mediate consequence, neither do the more general projective group and the group of 
general diffeomorphisms of the plane. Therefore, scale should not be (globally) canon- 
ized and the scaling sub-group should instead be sampled. We will revisit this issue in 
Section 5.2 where we introduce the selection tree. 

So, although [172] suggests that invariant sufficient statistics can be devised for general 
viewpoint changes, this is only theoretically valid in the limit when there is no quan- 
tization and the data is available at infinite resolution. In the presence of quantization, 
canonization of anything more than the isometric group is not advisable. Because the 
composition of scale and quantization is a semi-group, the entire semi-orbit - or a dis- 
cretization of it - should be retained. This corresponds to a sampling of scales, rather 
than a scale selection procedure. 

The additive residual n(x) does not pose a significant problem in the context of 
quantization since it is assumed to be spatially stationary and white/zero-mean, so 
quantization actually reduces the noise level: 

n(xi) = / n(x)dx 0. (3.18) 
Instead, the other important nuisance is occlusion. 
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Interaction of group nuisances with occlusion: Local invariant features 

Planar rotations do not generate occlusions, so canonization of rotation is, at least to 
first approximation, unaffected by occlusion. Translation, however, is. In particular, 
parallax if planar translations are due to parallax, consisting of a translation of the optical 

FRONTO-PARALLEL center in front of a scene that is not flat and fronto-parallel, there will be occlusions 

unless the scene is concave. Therefore, translation cannot be globally canonized either. 
This means that one should consider the set of all possible translations and defer the 
treatment of the nuisance to training or testing. This is indeed an approach that has 
recently taken hold in the recognition literature [46, 188]. 

However, owing to the statistics of natural images [138, 144], the response of any 
translation-co-variant detector will be neither a unique global extremum nor the entire 
lattice A = D n Z 2 , but instead a discrete set of (multiple, isolated) critical points 
Ti G A. Therefore, one could hypothesize that each of them is a viable canonical rep- 
resentative, and then test the hypothesis at decision-time. This means that one should 
construct multiple descriptors for each canonical translation, at multiple scales, and 
then marginalize or eliminate occlusions by a max-out procedure at decision time, that 
now corresponds to a (combinatorial) search among all possible canonical locations 
in putatively corresponding images. The (multiple) canonization procedure provides 
multiple similarity frames, one per each canonical translation Ti, and at that translation 
for each sample scale aj , and a canonized rotation Rij : 

9ij = {Ti,(jj,Rij,mij} (3.19) 

where we have indicated with a canonized contrast transformation, although con- 
trast can equivalently be eliminated by considering the level lines or the gradient direc- 
tion as described in Section 6.1. 

The image, or a contrast-invariant, relative to the reference frame identified by c/ij 
is then, by construction, invariant modulo a selection process, in the sense that the 
corresponding frame may or may not be present in another datum (e.g. a test image) 
depending on whether it is co-visible. Thus, we have a collection of local invariant 

LOCAL INVARIANT FEATURES features of the form 

<M') = IiRjjiS^x -TO), (3.20) 

z, j = 1, . . . 7V T , N s \B aj (x + Ti) n ft = 

where S : R 2 —> M 2 ; x \-> S(x) = ax, with a > is a scale transformation, with 
S~ 1 (x) — x/a, T G R 2 is a planar translation, and R G SO(2) is a planar rotation 
[125]. Here Nt and Ns are the number of canonical locations and sample scales 
respectively. Note that the selection of occluded regions, which is excluded from the 
descriptor, is not know a-priori and will have to be determined at decision time as part 
of the matching process. In particular, given one image (e.g. the training image) Ii and 
its features faj (Ii), and another image (e.g. the test image) I2, and its features <j)im{h)^ 
marginalizing the nuisance amounts to testing whether <j)ij(Ii) — (j)im(h), for all 
possible combinations of i, j, /, m, corresponding to the hypothesis that the canonical 

11 If contrast invariance is desired, the image / can be replaced by the gradient direction nwFM • 
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region around the canonical translation Ti at scale aj in image I\ is co-visible with the 

region around the canonical translation T\ at scale (7 in image 1 2 . MARGINALIZING occlusion via combinatorial selec- 

In this sense, we say that translation is locally canonizable: The description of 
an image around each U at each scale Cj can be made invariant to translation, unless 
the region of size Cj around U intersects the occlusion domain ft C D, which is a 
binary choice that can only be made at decision time, not by pre-processing test images 
independently. 

This is particularly natural when translation is canonized using a partition of the im- 
age into e-constant (or stationary) statistics, such as normalized intensity, color spectral 
ratios, or normalized gradient direction histograms. In fact, each region (a node in the 
adjacency graph) or each junction (a face in the adjacency graph) can be used to canon- 
ize translation, and the image can be described by the statistics of adjacent neighbors, 
as we do in Section 4.3. 

So, although translation is not globally canonizable, we will refer to its treatment 
as local canonization modulo a selection process to detect whether the region around 
the canonical representative is subject to occlusion. What we have in the end, for 
each image I, is a set of multiple descriptors (or templates), one per each canonical 
translation, and for each translation multiple scales, canonized with respect to rotation 
and contrast, but still dependent on deformations, complex illumination and occlusions. 
One can also think of local canonization as a form of adaptive sampling, where regions 
that are not covariant are excluded. The same reasoning can also be applied to scale, 
where one can think of (multiple) scale selection as adaptive sampling of scale. 

In Section 4 we will show how to construct local co- variant detectors, and in Sec- 
tion 6 how to exploit this process to construct local invariant descriptors. 

Remark 7 (When to avoid canonization) As we have done already, we would like to 
remind the reader that canonization is just a way to factor out the simple nuisances. 
Because of occlusions, this canonization process reduces to feature selection (in each 
individual image) and combinatorial matching. This process represents an approxima- 
tion of the "correct" marginalization or max-out procedure described in Section 2.3. 
Therefore, if computational resources allow, the best course of action is to max-out or 
marginalize the nuisance, i.e. to avoid the canonization process altogether. This choice 
is usually dictated by the application constraints, and should be made with knowledge 
of the tradeoffs involved: Canonization reduces the complexity of the classifier, but at 
a cost in discriminative power. Marginalization is the best option if one has a prior 
available. Otherwise, max-out is the best option if the classifier can be computed in 
time useful for the task at hand. 

3.4.1 Clarifying the formal notation log 

So far we have used the notation / o g to indicate that, if / = fo(e, £, v), then I o g — 
h(g,t;, v). This may seem inconsistent, as the scene and the image live in different 
spaces, and therefore g acting on £ is not the same as g acting on /. We are now ready 
to explain this inconsistency. 

For a nuisance g to be eliminated in pre-processing without a loss, it has to be a 
group, as well as commute with non-group nuisances. After the discussion in the previ- 
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ous sections, this can be expressed as h(g, £, v) = gh(e, £, z/) for all z/. Therefore, if we 
divide the group nuisance g into an "invertible" component g % and a "non-invertible" 
component g n , so that I = h(g z g n ^ £, i/), we then have that 



Hg i g n M=9 i Hg n ,£,v). 



(3.21) 



But because g n is not invertible, and therefore it cannot be eliminated in pre-processing 
without a loss, its group structure is of no use, and we might as well lump g n with the 
non-invertible nuisance v. Therefore, from this point on, we can assume as a formal 
model of image formation the following: I t = gth(e, £, v t ) + n t , and for simplicity we 
can omit the identity e. Therefore, we can represent this equivalently a / = gh(£, v) 
or I = h(g,£,v). Now it should be clear that the (invertible) group g can act on the 
image /, so the notation / o g = gl is justified. 

For instance, in the Lambert- Ambient model, I(x) = p(p), contrast normaliza- 
tion is commutative: A contrast transformation applied to the albedo k(p(p)) induces a 
contrast transformation on the image, and therefore it can be neutralized ("inverted") by 
performing co-variant detection on the image. So, if I(x) = fc(p(p)), then k~ 1 (I(x)) = 
k~ 1 k(p(p)), where k = k(I), is invariant to contrast. Similarly, a planar rotation ap- 
plied to the scene, p(Rp), where 



R 



cos — sin 6 
sin 6 cos 
1 



induces a rotation on the image plane, I(rp) where 



cost 
sin^ 



— sin( 
cos# 



so I(rx) = p(Rp). A choice of orientation on the image plane, corresponding to a 
canonical rotation r = f(J) can then be used to canonize R. Calling 



R 







we have that I(r~ 1 x) = I(r T x) = p(R T Rp) is invariant to a planar rotation R. The 
story is considerably different for translation, and for general rigid motions, including 
out-of-plane rotation. 

A translation along the optical axis, T = [0, 0, s] T induces a transformation on the 
image plane that depends on the shape of the scene S. If the scene is planar and fronto- 
parallel, translation along the optical axis induces a re-scaling of the image plane, also 
a group: I(sx) = p(p + T). However, if the scene is not planar and fronto-parallel, 
forward translation can generate very complex deformations of the image domain, in- 
cluding (self-) occlusions. Similarly, for translation along a direction parallel to the 
optical axis, T = [u, v, 0] T , only if the scene is planar and fronto-parallel does this 
result in a planar translation, so that I(x + [u, v] T ) = p(p + T). 
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It should be mentioned that different data modalities yield different commutativity 
properties, so it is possible that for certain data types the entire group nuisances may 
be commutative. For instance, cast shadows are non-invertible nuisances in grayscale 
images. However, if one has multiple spectral bands available, then several contrast- 
invariant features, such as spectral ratios, can be employed to eliminate the cast shad- 
ows. 



3.4.2 Local canonization of general rigid motion 

As we have discussed, parallax motion, including translation and out-of-plane rota- 
tion, does not commute with rotation, therefore one can at best locally canonize it. 
If one wishes to eliminate the effects of an arbitrary rigid motion, the entire group 
of diffeomorphisms has to be canonized; as shown in [ ], this comes at a loss of 
discriminative power, since scenes of different shape are lumped onto the same equiv- 
alence class. But since occlusions force locality, one can also approximate general 
diffeomorphisms locally, and therefore only canonize sub-groups of planar diffeomor- 
phisms. There is then a tradeoff between the sub-group being canonized and the size 
of the domain where the canonization is valid. Given an arbitrary margin, there will 
be a domain where the deformation is approximated by a translation, a larger domain 
where it is approximated by an isometry, a larger domain yet where it is a similarity, an 
affine transformation, and a projective transformation. 

To see that, consider a general rigid motion in space (R,T). That induces a defor- 
mation of the image domain that can be described, in the co-visible regions, by equation 
(2.9). In more explicit form, if we represent shape S as the graph of a function in one 
of the images, say at time t = 0, so that p = p(xq) = xqZ(xq), with x £ then 
under the assumptions of the Lambert- Ambient model, It(x t ) = p(p) = Io(xo) where 

x t = w(x ) = 7r(Rx Z(x ) + T) = J _ — " (3.22) 

R(3,:)X Z(Xo) +T 3 

in the co-visible region D D w(D). Here we have used Matlab notation, where R(v.2,:) 
means the first two rows of R. 

Of course, the size of the region where the approximation is valid cannot be de- 
termined a-priori, and should instead be estimated as part of the correspondence hy- 
pothesis testing process [ ]. However, it is common practice to select the size of the 
regions adaptively as a function of the scale of the translation-covariant detector. This 
will be further elaborated in Chapter 5. 

In the next section we establish the relation between local canonization and texture. 



3.5 Textures 



This section establishes the link between the notion of canonization, described in the 
previous section, and the design of invariant features, tackled in the next section. 
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3.5.1 Defining "textures" 

A "texture" is a region of an image that exhibits some kind of spatial regularity. Figure 
3.2 shows some examples of what are often called regular textures. They share the 
spatial regularity of some elementary structure, or "texture element", or "texton" [97]. 
The images in Figure 3.3, on the other hand, do not exhibit regular repetition of any 
such structure. Instead, they are characterized by the fact that some ensemble property 
of the image is spatially homogeneous, i.e. , some statistic is translation-invariant. 
Such ensemble properties are pooled in a region whose minimal size plays the role 
of the elementary texture element. Translation invariance of statistics is captured by 
the notion of stationarity [165]. For instance, a wide-sense stationary process has 
translation-invariant mean and correlation function. A strict-sense stationary process 
has translation-invariant distribution. The concept of stationarity can be generalized 




Figure 3.2: Regular textures 

from simple translation invariance to invariance relative to some group: In Figure 3.5 
(top) the images do not have homogeneous statistics; however, in most cases one can 
apply an invertible transformation to the images that yield a spatially homogeneous 
statistics. 

The following characterization of texture is adapted from [63], where the basic 
definitions of stationarity, ergodicity, and Markov sufficient statistics are described in 
some detail. The important issue for us is that, if a process is stationary and ergodic, 
a statistic <j) can be predicted from a realization of the image in some region O. Once 
established that a process is stationary, hence spatially predictable, we can inquire on 
the existence of a statistic that is sufficient to perform the prediction. This is captured 
by the concept of Markovianity: 

I(x)±I u \</> u \ x . (3.23) 

Equation (3.23) establishes (j)^ as a Markov sufficient statistic. In general, there will 
be many regions uj that satisfy this condition; the one with the smallest area |cj| = r, 
determines a minimal sufficient statistic and the statistic defined on it is the elementary 
texture element. From now on, we will refer to (j)^ as the minimal Markov sufficient 
statistic. 
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Figure 3.3: Stochastic textures 



We can then define a texture as a region of an image that can be rectified into a 
sample of a stochastic process of a planar lattice that is locally stationary, ergodic and 
Markovian. More precisely, assuming for simplicity the trivial (translation) group, a 
region ft C D C M 2 of an image is a texture at scale a > if there exist regions 
uo C Co C ft such that I is a realization of a stationary, ergodic, Markovian process 
locally within ft, with 1^ sl Markov sufficient statistic and a = \Co\ the stationarity 
scale. We then have that 

H(I(x)\u - {x}) = H(I(x)\fL - {x}). (3.24) 

We can therefore seek for uo c ft that satisfies the above condition. Without a com- 
plexity constraint, there are many regions that satisfy the above condition. To find u, 
we therefore seek for the smallest one, by solving 

lu = dLigmmH(I(x)\uj — {x}) + \ (3.25) 

' p 

Note that this is a consequence of the Markovian assumption and the Markov sufficient 
statistic satisfies the Information Bottleneck principle [183] with f3 — ^ oo. As a special 
case, we can choose uj to belong to a parametric class of functions, for instance square 
neighborhoods of x, excluding x itself, of a certain size a, B a (x), so the optimization 
above is only with respect to the positive scalar a. The tradeoff will naturally settle for 
1 < r < a. Therefore, we can simultaneously infer both a and r by minimizing the 
sample version of (3.25) with a complexity cost on a = \uj\: 

o),<T = arg min H(I{x)\uj — {x}) + — \uj\. (3.26) 

oj,a=\oj\ p 

Note that both uo and Co are necessary for extrapolation: uo defines the Markov neigh- 
borhood used for comparing samples, and Co defines the region where such samples are 
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sought to approximate the probability distribution p(I(x) \uj — {x}). 

Eq. (3.26) provides means to infer both the Markov sufficient statistic as well as the 
scale a = \uj\ of a texture, assuming that the stationarity and ergodicity assumptions are 
satisfied. Testing for stationarity (ergodicity must be assumed and cannot be validated) 
amounts to inferring Q, a texture segmentation problem. Estimating the group element 
g G G amounts to a canonization process. 

Given {I(x),x G Q}, compression is achieved by inferring the (approximate) min- 
imal sufficient statistic uj and the stationarity scale a by solving (3.26). Then for 
any uj C Q with \uo\ = a is stored. 

Given a compressed representation I(uo), we can in principle synthesize novel in- 
stances of the texture by sampling from dPf^I^) within uj. In a non-parametric setting 
this is done by sampling directly neighborhoods I(uj) within uj. To extrapolate the 
texture from a given sample I(uj) compatibility conditions have to be ensured at the 
boundaries of uj. 

To find Q we can solve 

n = arg min H(I(x G duo) \uo) + ^- (3.27) 

oo \Uj\ 

where the entropy is normalized by the length of the boundary \duu\ (entropy rate). 
Since we do not know the distribution outside O, the above is approximated by 

ft = arg min H(I(x G duo)\uoJ(x G Suj - {x})) + (3.28) 

OJ \UJ\ 

In the continuum, the problem above can be solved using variational optimization as 
in [178]. In the discrete, the reader can refer to [63] for a description of compression, 
synthesis, segmentation and rectification algorithms. 

3.5.2 Textures and Structures 

In this section we establish the relation between textures, defined in the previous sec- 
tion, and structures, defined by co-variant detectors. Consider a point x G D and its 
neighborhood. If it is canonizable at a scale e, there is a co- variant detector with support 
e (a statistic) that has an isolated extremum. This implies that the underlying process is 
not stationary at the scale \u)\ — e. Therefore, it is not a texture. It also implies that any 
region uj of size e = |cj| is not sufficient to predict the image outside that region. This 
of course does not prevent a region that is canonizable at e to be a texture at a scale 
a ^ e. Within a region a there may be multiple frames of size e, spatially distributed 
in a way that is stationary/Markovian. Vice- versa, if a region of an image is a texture 
with a = uj, it cannot have a unique (isolated) extremum within uj, lest it would not 
be a sample of a stationary process. Of course, it could have multiple extrema, each 
isolated within a region of size e « a. The above argument is a sketch of a proof of 
the following: 

Theorem 4 (Structure-Texture) For any given scale of observation a, a region uj with 
\uj\ = a is either a structure or a texture. 
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Hence one can detect textures for each scale, as the residual of the canonization pro- 
cess described in the earlier part of this chapter. One may have to impose boundary 
conditions so that the texture regions fill around structure regions seamlessly. This can 
be accomplished by an explicit generative model that enforces boundary conditions 
and matches marginal statistics in the texture regions, following [ ], as illustrated by 
[217]. However, this has to be done across all scales, unlike [217], since whether a 
region is classified as texture or structure depends critically on the scale a. 

Since the decision of what is a structure depends on time (or, more in general, on 
the presence of multiple images), consequently, what is a texture is also a decision 
that requires multiple images. For instance, the random dot stereogram is everywhere 
canonizable and properly sampled at some scale a, since one can establish correspon- 
dence. It is, however, a texture at all coarser scales, where any statistic is translation 
invariant. 




Figure 3.4: Testing the definition of texture. Both images above satisfy the defini- 
tion of texture. Images like the one on the left, however, are often called "textureless." 
Therefore, textureless is a limiting case of a trivial texture where any statistic <p, com- 
puted in any region uo, is constant with respect to any group G. Images like the one 
on the right are called random-dot displays, and Julesz showed that one can establish 
binocular correspondence between such displays (see Figure 5.2). 

Remark 8 ("Stripe" textures) The reader versed in low-level vision will notice that 

edges are not canonizable with respect to the groups we have discussed so far, including 

the translation group. This should come at no surprise, as we have anticipated in 

Remark 4, because of the aperture problem: While one can fix translation along the APERTURE PROBLEM 

direction normal to the edge (at a given scale), one cannot fix translation along the 

edge (Figure 3.7). However, if one considers the sub-group of translations normal to 

the edge, the corresponding region is indeed canonizable. 

The "random-dot display" of Figure 3.4 (right) seems to provide a counter-example 
to Theorem 4. In fact, Julesz [ ] showed that random-dot displays can be successfully 
fused binocularly, which means that pixel-to-pixel correspondence can be established, 
which means the the image is canonizable at every pixel which, by Theorem 4, means 
that it is not a texture. But the random-dot display satisfies the definition of texture 
given in the previous section. How can this be explained? 
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Figure 3.5: Interplay of texture and scale. Whether something is a texture depends 
on scale, and "structures" can appear and disappear across scales. Such "transitions" 
can be established in an ad-hoc manner using complexity measures as in Figure 3.6, or 
they can be established unequivocally by analyzing multiple views of the same scene 
through the notion of proper sampling, introduced in Section 5.1. 

First, there is no contradiction, since random-dot displays are indeed canonizable 
at the scale of the pixels a = 1. However, they are stationary at any coarser scale, 
hence they are textures at those scales. 

More importantly, however, whether two regions of an image can be matched at 
a given scale depends not only on whether they are canonizable, but also whether 
they are properly sampled at that scale. We have anticipated the concept of proper 
sampling in Remark 6, and we will introduce and discuss the notion of proper sampling 
in Section 5.2, where we will revisit the random-dot display. There, we will see that 
the critical scales within which regions can be successfully matched (hence not called 
"textures" per Theorem 4) cannot be decided by analyzing one image alone. Instead, 
the "thresholds" for the transition from "matchable" to "non-matchable" require having 
multiple images of the same scene available. 
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Figure 3.6: Texture-structure transitions (from [23]). The left plot shows the entropy 
profile for each of the regions marked with yellow boundaries in the middle and right 
figure. The abscissa is the scale of the region where such entropy is computed, starting 
from a point and ending with the entire image. As it can be seen, the entropy profile 
exhibits a staircase-like behavior, with each plateau bounded below by the scale cor- 
responding to the small uo, and above by the boundary ofB. This also shows that the 
neighborhood of a given point can be interpreted as texture at some scales (all the scales 
corresponding to flat plateaus), structure at some other scale (the non-flat transitions), 
then texture again, then structure again etc. 




Figure 3.7: The aperture problem: Locally, in a ball of radius cr, B a , motion can only 
be determined in the direction of the non-zero gradient of the image. The component 
of motion parallel to an edge cannot be determined. This causes several visual illu- 
sions, including the so-called "barber pole" effect where a rotating cylinder appears to 
translate upward. 
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Chapter 4 

Designing feature detectors 



Section 3.4 showed that rotation and contrast can be canonized, and that translation and 
scale can only be locally canonized. The definition of a canonized feature requires the 
choice of a functional ip (a co- variant detector), and an invariant statistic <p (a feature 
descriptor). Here we will focus on the location- scale group g = {x, a} C M? x R + , 
assuming that rotation and contrast have been canonized separately (later we advocate 
using gravity as a global canonization reference for orientation). The goal, then, is to 
design functionals ip that satisfy the requirements in Definition 3 that they yield isolated 
critical points in G. At the same time, however, we want them to be "unaffected" as 
much as possible by other nuisances. 1 Such "insensitivity" is captured by the notions 
of commutativity, stability, which we introduce next, and proper sampling, that we 
introduce in the next chapter. 



4.1 Sensitivity of feature detectors 

We consider two qualitatively different measures of sensitivity. Note that we use the 
(improper) term "stability" because it is most commonly used in the literature, even 
though there is no equilibrium involved, and therefore the proper system-theoretic 
nomenclature would really be sensitivity. We will start from the most common notion 
of bounded-input bounded-output stability. 

Definition? (BIBO stability) A G-covariant detector ip (Definition 3) is bounded- 
input bounded-ouput (BIBO) stable if small perturbations in the nuisance cause small 
perturbations in the canonical element. More precisely, V e > 3 S = 5(e) such that 
for any perturbation Sv with \\Sv\\ < S we have \\Sg\\ < e. 

Note that g is defined implicitly by the equation ^(1, g(I)) = 0, and a nuisance per- 
turbation 5v causes an image perturbation SI = ^Sv. Therefore, we have from the 



So, we want ^-stable G-covariant detectors. 
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Inverse Function theorem 2 [73] 

dh 

5g = -\Jg\- 1 —5v = K5v (4.1) 

where J g is the Jacobian (3.7) and K is called the BIBO gain. As a consequence of 
the definition, K < oo is finite. The BIBO gain can be interpreted as the sensitivity 
of a detector with respect to a nuisance. Most existing feature detector approaches are 
BIBO stable with respect to simple nuisances. Indeed, we have the following 

Theorem 5 (Covariant detectors are BIBO stable) Any covariant detector is BIBO- 
stable with respect to noise and quantization. 

Proof: Noise and quantization are additive, so we have = 8v, and the gain is just 
the inverse of the Jacobian determinant, K = \Jg\~ 1 . Per the definition of co-variant 
detector, the Jacobian determinant is non-zero, so the gain is finite. 
BIBO stability is reassuring, and it would seem that a near-zero gain is desirable, be- 
cause it is "maximally (BIBO)-stable." However, simple inspection of (4.1) shows that 
K = is not possible without knowledge of the "true signal." In particular, this is the 
case for quantization, when the operator ip must include spatial averaging with respect 
to a shift-invariance kernel (low-pass, or anti-aliasing, filter). However, a non-zero 
BIBO gain is irrelevant for recognition, because it corresponds to an additive perturba- 
tion of the domain deformation (domain diffeomorphisms are a vector space), which is 
a nuisance to begin with (corresponding to changes of viewpoint [ ]). On the other 
hand, structural instabilities are the plague of feature detectors. When the Jacobian is 
singular, | Jy^ | —> 0, we have a degenerate critical point, a catastrophic scenario [152] 
STRUCTURAL STABILITY MARGIN whereby a feature detector returns the wrong canonical frame. 

Definition 8 (Structural Stability) A G-covariant detector ijj \ ip(I,g(I)) = is 
Structurally Stable if small perturbations 5v preserve the rank of the Jacobian matrix: 

35>0\\Jf)\^0=>\J §+ s § \^0 V5v\\\5v\\<5 (4.2) 
with 5g given from (4.1). 

In other words, a detector is structurally stable if small perturbations do not cause sin- 
gularities in canonization. We define the maximum norm of the nuisance that does not 
cause a catastrophic change [152] in the detection mechanism the structural stability 
margin. This can serve as a score to rank features. 

Definition 9 (Structural Stability Margin) We call the largest S that satisfies equa- 
tion (4.2) the structural stability margin: 

5*=sup||HI I \J§+ksu\^0 (4-3) 

2 One has to exercise some care in defining the proper (Frechet) derivatives depending on the function 
space where ip is defined. The implicit function theorem can be applied to infinite-dimensional spaces so 
long as they have the structure of a Banach space (Theorem A.58, page 246 of [105]). Images can be 
approximated arbitrarily well in L 1 (M 2 ), that is Banach. 
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Example 5 (Structural stability for the translation-scale group) 

Consider the set of images, approximated by a sum of Gaussians as described in Sec- 
tion B.2.1. The image is then represented by the centers of the Gaussians, Hi, their 
variance a\ and the amplitudes so that I(x) = ctiQ{x — fi^af). Consider a 
detection mechanism that finds the extrema g = {£, a} of the image convolved with a 
Gaussian centered at x, with standard deviation a: ip(I, g) = I * VQ(x — x\ a 2 ) = 0. 
Among all extrema, consider the two x\,x<i that are closest. Without loss of generality, 
modulo a re-ordering of the indices, let \i\ and \±2 be the "true" extrema of the origi- 
nal image. In general x\ ^ \i\ and x 2 ^ ji2- Let the distance between \i\ and \±2 be 
d = |/i 2 — \i\ \, and the distance between the detected extrema be d — \x 2 — x\ \. Trans- 
lation nuisances along the image plane do not alter the structural properties of the 
detector (d does not change). However, translation nuisance orthogonal to the image 
plane do. These can be represented by the scaling group a, and in general d = d(a) is 
a function of a that starts at d = d when a = and becomes d — when a = a*, i.e. 
when the two extrema merge in the scale-space. In this case, = <r* is the structural 
stability margin. It can be computed analytically for simple cases of Gaussian sums, 
or it can be visualized as customary in the scale-space literature. It is the maximum 
perturbation that can be applied to a nuisance that does not produce bifurcations in 
the detection mechanism (Figure 4.1). Note that one could also compute the struc- 
tural stability margin using Morse's Lemma, or the statistics of the detector (e.g. the 
second-moment matrix). Finally, the literature on Persistent Topology [56, 42, 38] also 
provides methods to quantify the life-span of structures, which can be used as a proxy 
of the structural stability margin. Indeed, the notion of structural stability proposed 
above is a special case of persistent topology. 

A sound feature detector is one that identifies Morse critical points in G that are as far 

as possible from singularities. Structural instabilities correspond to aliasing errors, or 

improper sampling (Section 5.1), where spurious extrema in the detector ip arise that PROPER SAMPLING 

do not correspond to extrema in the underlying signal. Proper sampling depends on 

the detector functional ip, that in the presence of quantization depends on the scale a 

(the area of the support of the quantization kernel). Thus the ideal detector is one that 

chooses g that is as far as possible from singularities in the locus {g \\7ip(I, g) = 0}. 

The selection of the best canonical frames according to this principle is described in 

Section 5.2. 

Note that a canonical frame g is often called a 'feature point" or (< keypoint" 
(or "corner"), an inappropriate nomenclature unless G is restricted to the translation 
group. Note also that one should not confuse a (canonical reference) frame g from a 
(video) frame, which is an image I t that is part of a sequence {I t }J =1 obtained sequen- 
tially in time. Which "frame" we are referring to should be clear from the context. 

4.2 Maximum stability and linear detectors 

In this section, as a way of example, we introduce detectors designed to be maximally 
BIBO stable. In other words, we look for classes of functionals ip that yield maximally 
isolated critical points. This seems to be an intuitive appealing criterion to follow. 
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Figure 4.1: Catastrophic approximation: The original data is the sum of two Gaussians, 
one centered at [i\ = (left, red dotted line), the other at a distance /12 = d (left, blue 
dotted line) that increases from to 5 (green line). The translation-detector x initially 
detects one extremum whose position (blue) deviates from fi 2 as d increases. However, 
at a certain point the detector x splits into two, x 2 that follows \i 2 , and x\ that converges 
towards fii = 0. Note that at no point does the translation detector x coincide with the 
actual modes /i (i.e. , this detector is not BIBO insensitive). Furthermore, only for 
sufficiently separated, or sufficiently close, extrema does the location of the detector 
approximate the location of the actual extrema. 



However, we will see in Section 5.1 that this does not guarantee structural stability. 
Nevertheless, we describe this approach because it is one of the most common in the 
literature, and it is better than just testing for the transversality condition (3.7), which 
is fragile in the sense that, in the presence of noise, it will almost always be satisfied. 
In Section 5.1 we will introduce an approach to feature detection that is better suited to 
the notion of structural stability. Therefore, the reader can skip this section unless he 
or she is interested in understanding the relation between the approach proposed there 
and the ones used in the current literature. 

To design a maximally BIBO-stable detector, we look for functionals that yield critical points where the 
Jacobian determinant is not just non-zero, but it is largest. Following the most common approaches in the 
literature, we try to capture this notion using differential operators. 3 In this case, we look for points that are 
at the same time zeros of Vip = ^, and also critical points of | V^|. Now, in general, for a given class ^, 
there is no guarantee that the first set (zeros of ^) intersects the second (zeros of V| V^|). However, we can 
look for classes of functions where this is the case, for all possible images /. These will be functionals that 
satisfy the following partial differential equation (PDE): 

^ = V|W|. (4.4) 

Note that designing feature detectors by finding extrema of functionals requires continuity, which is some- 
thing digital images do not possess. However, instead of finding extrema of the (discontinuous) image one 
can find extrema of operators, exploiting a duality as customary [43, 119, 180] especially for translation- 
invariance. Note that the "shift-plus-average" in [43] corresponds to template blurring, a lossy process that 
is not equivalent to proper marginalization or extremization. 
LINEAR DETECTORS We can start our quest for such functionals among linear 4 ones, i.e. of the form ^(/, {x, a}) = 

Q * I(x,a) = [0, 0, 0] T , where * is a convolution product. In particular, Q can be thought of as the gradient 

3 An alternative approach not requiring differentiability is presented in Section 4.3. 
4 The advantage of linear functionals is that they commute with all additive nuisance operations, such as 
quantization and noise. 
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of a scalar kernel k : R 2 x R 3 — > R + ; (x, {y, a}) i->- fc(cc — y;a), acting on the image via a convolution 
/c * a) = f k(x — y\ a)I(y)dy, so the zeros of Q * / are the critical points of k * /, so Q = X7k T : 

g = {x,a}\ j(Vk T (x-y;a)I(y)dy = Q(x,a)*I = 0. (4.5) 

Written in terms of Q, the PDE above becomes an integro-partial differential equation (I-PDE) 

g * i = v\vg * i\ (4.6) 

a condition that should be satisfied for all / £l. Note that the condition is only required at the zero-level set, 
so technically speaking a solution of the entire (4.6) is not necessary. We can approximate a generic function 
/ with a linear combination of (basis) vectors {bi(x)}, I(x) = B(x)a, where B(x) = [bi(x), . . . , &jv(#)] 
are, for instance, a complete orthonormal basis, then the condition above has to be satisfied for all functions 
bi(x). Of the many choices of basis, we favor one that follows a result of Wiener [203], that states that 
any positive L 1 distribution can be approximated arbitrarily well with a positive combination of Gaussian 
kernels. So, bi(x) = Af(x — xi\ai) are Gaussian kernels centered at Xi with standard deviation cr^. 
Therefore, the condition above becomes 

G*N{x,(j) = V\VG*J\f(x,a)\ VO,a). (4.7) 

All the gradient operators can thus be transferred to the Gaussian kernel by linearity, and derivatives of 
Gaussian are products of Gaussians with Hermite polynomials, that form a complete orthonormal family in 
L 2 . Using Jacobi's formula for the gradient of the determinant of a function with respect to its elements, we 
get 

Q * Mix, a) = trace (adj(£ * VAT)£ * D 2 M) (4.8) 

where D 2 denotes the Hessian matrix and adj is the adjugate matrix of co-factors (each element i, j is the 
determinant of the minor obtained by removing row i and column j). This is another way of looking at (4.4) 
and (4.6), but now as an ordinary (non-linear) functional equation in the unknown Q. 

The equation (4.4) is related to Monge- Ampere's equation in optimal transport; an approximation of the 

determinant of the Hessian has been used for a Gaussian Kernel in approximate form (using Haar wavelet 

bases and the Integral Image) in the SURF detector [17]. 



MONGE-AMPERE EQUATION 
HARRIS' CORNER 



Example 6 (Harris' corner revisited) The discussion above legitimizes the use ofHessian- 
of-Gaussian operators for feature detection, for instance [ ]. The Laplacian-of- 
Gaussian operator, used in the popular SIFT [ ], can also be partially justified 
on the grounds of first-order approximation of the Hessian, as customary in Newton 
methods. The other popularly used detector, Harris ' corner detector, however, is not 
explained by the derivation above. In fact, note that Harris' operator 

H(I,0) = | J V T IV(I)dx\ - ATrace (^J V T IV(I)d^j (4.9) 

is not a linear functional of the image. It is still possible, however, to define the canon- 
izing operator 

1 p(I,g)=H(I,g) (4.10) 

and verify whether it yields isolated critical points. This is laborious and not relevant 
in the context of our discussion. We only mention that, because H is non-linear, it 
does not commute with quantization, and therefore it does not meet the conditions 
of a proper co-variant detector (it is not commutative). Instead, below we suggest a 
different procedure to detect corners (or, more in general, junctions) that also provides 
a canonization procedure. 



62 



CHAPTER 4. DESIGNING FEATURE DETECTORS 




Figure 4.2: Representational structures: Superpixel tree (top), dimension-two struc- 
tures (color/texture regions), dimension-one structures (edges, ridges), dimension- zero 
structures (Harris junctions, Difference-of -Gaussian blobs). Structures are computed at 
all scales, and a representative subset of (multiple) scales are selected based on the lo- 
cal extrema of their respective detector operators (scale is color-coded in the top figure, 
red=coarse, blue=fine). Only a fraction of the structures detected are visualized, for 
clarity purposes. All structures are supported on the Representational Graph, described 
in the next figure. 

4.3 Non-linear detectors and the segmentation tree 

Linear functionals are not the only feature detectors. Indeed, Theorem 4 establishes 
the link between feature detection and (generalized) texture segmentation. Therefore, 
rather than testing for canonizability (as done customarily in feature detection) one 
can test for stationarity (as done customarily in segmentation) and then construct fea- 
tures from the segmentation tree. The caveat is that, because of the interplay of the 
scale group with quantization, and of the translation group with occlusion, no single 
segmentation can be used as a viable canonization procedure, and instead the entire 
segmentation tree must be considered. 

The starting point for this approach to canonization is a different approximation model of the image. 
Rather than a linear combination of globally-defined basis vectors such as Gaussians or sinusoids, we use 
simple functions. These are constant functions on a compact domain, whose universal approximation prop- 
erties in several measures are guaranteed by Weierstrass' theorem [160]. In particular, let a > be a given 
"scale." Then, given an image, we can find a partition of the domain D and constant values such that a 
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Figure 4.3: Representational Graph (detail, top-left) Texture Adjacency Graph (TAG, 
top-right); nodes encode (two-dimensional) region statistics (vector-quantized filter- 
response histograms), pairs of nodes, represented by graph edges, encode the likeli- 
hood computed by a multi-scale (one-dimensional) edge/ridge detector between two 
regions; pairs of edges and their closure (graph faces) represent (zero-dimensional) 
attributed points (junctions, blobs). For visualization purposes, the nodes are located 
at the centroid of the regions, and as a result the attributed point corresponding to a 
face may actually lie outside the face as visualized in the figure. This bears no conse- 
quence, as geometric information such as the location of point features is discounted 
in a viewpoint-invariant statistic. 



combination of simple functions approximates the original image (or any statistic computed on it) to within 
cr in each region (all filter channels can then be combined into a vector description of a region). Specifically, 
for a given / El and cr, we assume there is iV and constants {ai , . . . , aj\r} and a partition of the domain 
{Si, . . . , Sjy} such that 



\I(x) -atj\ < aVx G Sj\ SiDSj 



D, 



(4.11) 



where 5ij is Kronecker's Delta. Then, if we define the simple functions as characteristic functions of Sj 



we can approximate the image with 



XSi 0) 



I(x) 



l,VxG Sj 
otherwise 



^2xS 3 {x)QLj. 
.7=1 



(4.12) 



(4.13) 



This is true for scalar- valued images, but a similar construction can be followed to partition the domain into 
regions (often called superpixels), based on cr-constancy of any other statistic, such as color or any other superpixels 
higher-dimensional feature </>. In any case, a superpixelization algorithm can be thought of as a quantizer, 
that is an operator that takes an image I and a parameter a and returns a family of domain partitions quantizer 
N = N(a), {Sj}^ =1 with 



{^}f=i 



I- f I{x)dx 

-)4\ J 



\Sj\ 



(4.14) 
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where \Sj | is the area of Sj . We can now use this functional to determine a co-variant detector g). For 
the case of translation, g = T, we define (multiple) canonical elements Tj to be the centroids of the regions 
Sj, Tj = j^-j f s xdx. Making the dependency on the image explicit, we have 

f(I,*)= I +~« )X * C (4.15) 

to which there corresponds the canonizing functional 

{T, a}) = [ xdx-T f dx. (4.16) 

Ucr(I) 

It can be easily verified that this functional is co-variant. For the case of translation (fixing a), g = T, we 
have that, for g that solves ip(I, g) = 0, and for any g 

ip{I o g, g o g) = / xdx — gg dx = xdx — gg dx = 

J<j>(Iog) J<t>{Iog) JgW) J<KI) 



= / gx dx — gg dx = g I xdx' — g / dx 

J 4(1) J ^(i) \U(i) J 4(i) J 

= grl>(I,g) = (4.17) 

where we have used the fact that the group g is isometric, so dx = dx' . 

One may also believe that this functional yields isolated extrema, based on the fact that 

|VVl = |§Jl= f dx>0. (4.18) 

However, this result, as well as (4.17), is misleading because it assumes that the superpixelization <j>a{I) 
is independent of (small variations in) T. More precisely, in order for translation to be canonizable, the 
canonization process has to commute with quantization. If we assume that the underlying "ideal image" 
I(x), x E D is continuous, then the "discrete" (quantized) image I(xi) = f B ^ ^ I(x)dx/e,Xi G A 
defined on the lattice A, is related to it via the mean- value theorem, that guarantees the existence, for each 
x i, of a translation 5i such that 

I(xi) = -J- f I{x)dx = I( Xi + Si) = I(xi) +ni = Iou (4.19) 

\Be\ JB e (xi) 

where v denotes the quantization nuisance. For the canonization process to be viable we must have 

4> a (I) = <\> a (Iov) (4.20) 

which is clearly not the case in general for a superpixelization algorithm. In fact, if we apply small pertur- 
bations to the levels ni, from (4.19) we get small perturbations in the location of the boundaries dSj, and in 
the location of their centroid, Tj — > Tj + Si. Since ni = VI(xi)Si, we see that for a superpixelization pro- 
cedure 4>a- to provide a viable canonization mechanism, it has to place the boundaries in such a way that ^ 
is negligible within each region, and as large as possible at region boundaries. We will see this spelled out 
more in detail shortly. The sensitivity of the boundary location as a function of a perturbation can be phrased 
in terms of BIBO stability, introduced in Definition 7. Specifically, if (j) a is a quantization/superpixelization 
operator acting on an e-quantized image (4.19), and {Sj}^ =1 = <t>a(I) and {Sj}^-^ = (f) a (I + n) the 
corresponding partitions, then (f) a is BIBO stable if, according to (4.1), 

||n|| < a \§j -Sj\<e (4.21) 

where \ S\ — 5*2 1 denotes the area of the set-symmetric difference of the two sets Si and S2. 

Note that, in general, the operator is not continuous, since it depends on N. When TV is fixed, it 
is possible to design a procedure that implements an operator 4> a that is guaranteed to be stable even if the 
image is not continuous (but piece- wise differentiable). For instance, in a variational multi-phase region- 
based segmentation approach, one has an energy functional E(I) that is continuous and minimized with 
respect to the (infinite-dimensional) partition {Sj}, so in this case we have that 4> a is defined implicitly by 
the first-order optimality conditions (Euler-Lagrange) SE(I) = 0, yielding a partial differential equation 
[178]. 
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More in general, consider a perturbation of the image I(x) = I(x) + n(x), with n(x) assumed to be 
small in some norm. Then after quantization of the domain, using the Mean Value Theorem, we have 

I(xi)= I I(x)dx= I I(x)dx+ [ VI(x)dxS t (4.22) 

JB e (xi) JB e (x x ) JB e ( Xi ) 

where we have assumed that 5(x) is constant within each quantization region B e (xi). Now, if we approx- 
imate the piecewise constant function in each B e {xi) into a piecewise constant function in the partition 
{Sj } f=i ' f° r a gi yen N, we have 

N N N „ 

Xs ■ ( x i)&j = ^2 XSj (xi)aj + ^2 XSj / VI(x)dx6(xi) (4.23) 

3 = 1 3 3 = 1 3 = 1 

from which we can obtain, defining SSj = Sj — Sj the set-symmetric difference between corresponding 
regions, and measuring the area in each region, 

N N 

EE^W«rE VI(x)dx5j (4.24) 
j=i Xi eSj j=i Js j 

where we have now assumed that 5{xi) is constant within each Xi E Sj, and we have called that constant 
Sj . So, we have that for a partitioning to be BIBO Stable, we must have 

n N r 

El^l^E / HV/(aOII<te (4-25) 

3 = 1 3 = 1 S 

which is guaranteed so long as the image is smooth within each region Sj (but it can be discontinuous 
across the boundary dSj). Indeed, any reasonable segmentation procedure would attempt, for any given 
(fixed) N, to place the discontinuities of the (true underlying) image I at the boundaries dSj, therefore 
guaranteeing stability of the boundaries with respect to small perturbations of the image per the argument 
above. In particular, for N = 2, there are algorithms that guarantee a globally optimal solution [36] that is, 
by construction, the most stable with respect to small perturbations of the image. These results are 
summarized into the following statement. 

Theorem 6 (BIBO Stability of the segmentation tree) A quantization/ sup erpixelization 
operator, acting on a piecewise smooth underlying field and subject to additive noise, 
is BIBO stable at a fixed scale for a fixed tolerance S (or complexity level N) if and 
only if it places the boundary of the quantized regions/ sup erpixels at the discontinuities 
of the underlying field. 

However, our concern is not just stability with respect to the partition {Sj}^^ for a 

fixed TV, but also stability with respect to singular perturbations [ ] that change the SINGULAR PERTURBATIONS 

cardinality of the partition (a phenomenon linked to scale since N = N(a)). So, the 

superpixelization procedure should be designed to be stable with respect to TV, which 

is not a test we can write in terms of differential operations on the image. However, 

one can construct a greedy method that is designed to be stable with respect to both 

regular (bounded) and singular perturbations. 

1. Start at level k = with each pixel representing as a region, Si(0) = B e (xi) with 
i— 1 , . . . , iVo = # A, the number of pixels in the image. 

2. Construct a tree by creating a node representing the merging of the two regions that 
have the smallest average gradient along their shared boundary: for each k, obtain a 
new merged region Si(k + 1) = Si t (k) D Si ± (k) where 

= arg min Ei jTn (k) = 

l,m | S i (j)CS l (k)US rn (k) 



[ \\VI(x)\\dx/ [ ds (4.26) 

JdSi(k)ndS m (k) JdSi(k)nds m (k) 
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3. Continue until the gradient between regions being merged fails the test (4. 1), that is 

kj = argmin E hrn (k) > a. (4.27) 

l ,m 

4. At the end of the procedure, we have N = N(cr) = No — k regions. 

These regions are, by construction, BIBO stable for any given value of N. Now to obtain 
a partition that is also stable with respect to singular perturbations, we have to choose 
regions that are unaffected by changes in N. To this end: 

5. Consider each pixel 5^(0), and follow its path in the tree, {Si(k)} as it is merged with 
other regions, together with the cost of each merging. Note that the cost is zero except 
when a merging occurs; call the mergings {ki 1 , ki 2 , . . . }. If at a certain ki we have that 

Si(ki) = Si(ki) n Sm(ki), then the cost is Ei(ki) = E hrn (ki) given by (4.27). 

6. Let {ki(i)j • • • } be the instances when the region Si is merged with one of its 
neighbors, and Ei(kj(i)) the corresponding costs. Furthermore, let Skj(i) = kj+\(i) — 
kj (i) be the "iteration gaps" between two merges involving the region Si. 

7. For each pixel i and each set of indices {kij }f=i sort the gaps 6kj(i) = ki - — ki j _ 1 in 
decreasing order of their minimum gap 

mm(Skj(i),Sk j+1 (i)). (4.28) 

The level k with the largest minimum gap corresponds to the region Si(k) containing 
the pixel x% that is least affected by singular perturbations. Indeed, for all kj (i) < k < 
kj+i (i), any singular perturbation is a change in the value of N = No — k that does not 
change the region Si(k). 

If we follow this procedure for every pixel i, sorting their paths in decreasing order of 
gaps, until a minimum gap a is reached, then we have that each initial region Si (0) is 
now included in a number of (overlapping) regions Si(kj(i)) that are not only BIBO 
stable, but also stable with respect to singular perturbations. This follows by construc- 
tion and is summarized in the following statement. 

Theorem 7 (Stable segments) Let 5^(0), with i = l,...,N Q be each pixel in an im- 
age. Then let {kj(i)}ij be (j -multiple) indices where Si is merged with neighboring 
regions, 

kj(i) = arg min E\ m (k) < cr (4.29) 

l,m \Si(k-l)cSi(k-l)nSm(k-l) 

then the regions {Si(kj(i))}^ 1 are BIBO stable and stable with respect to singular 
perturbations. 

Clearly many of these regions will be overlapping, and many indeed identical, so some 
heuristics can be devised to reduce the number of regions that, as it stands, can be more 
than the number of pixels in the image. A principle to guide such agglomeration is the 
so-called Agglomerative Information Bottleneck [61, 183]. 

This procedure relates to MSER [132], where however regions are created from the 
watershed, and are selected based on the variation of their boundary relative to their 
area as the watershed progresses. It also relates to other attempts to define "stable 
segmentations", for instance [ ]. However, in these cases stability is characterized 
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empirically, and the algorithm is not guaranteed to be stable against a formal defini- 
tion. Also, we are not necessarily interested in a partition of the domain, so long as a 
sufficient number of regions are present to enable local canonization. 

So we have shown that canonizing translation via the centroid of superpixels is 
automatically viable, per (4.17) and (4.18), but only so long as the superpixelization 
is stable with respect to additive noise, for instance generated by quantization mecha- 
nisms. 

Theorem 8 (Canonization via superpixels) Let <j> a (I) = {Sj}^ be a partition of 
the domain into superpixels. Then the functional (4.16) is a (local, translation) co- 
variant detector so long as it is BIBO stable. 

That (4.16) is covariant follows from (4.17), provided that (4.1) is satisfied. 

By the same token, one could canonize rotation by considering the principal axis 
of the sample covariance approximation of the regions Sj . Although [ ] advocates 
canonizing general viewpoint changes by considering only the adjacency graph of the 
regions Sj, as we have discussed in Section 3, one should canonize no further. 

An alternative (and dual) procedure for canonization is to use not the centroid of the 
superpixels, but their junctions, which are the points of intersection of the boundaries of 
adjacent superpixels. This provides a " corner detection mechanism" that is consistent 
with the theory, and can be used as an alternative to Harris' corner detector discussed 
in the previous section. It should be noted that critical point filters [ 67] have been 
proposed for the detection of junctions at multiple scales. 

Detection and canonization are not important per se, and can be forgone if one is 
willing to marginalize nuisances at decision time. Simple (conservative) detection can 
be thought of as a way not to select canonizable regions, but to quickly reject obviously 
non-canonizable ones. Where detection becomes important is when no marginalization 
can be performed, for instance in two adjacent frames in video, where one knows that 
the underlying scene is the same. As it turn out, this impoverished recognition problem, 
often called tracking, plays an important role in recognition, to the point where we 
devote a chapter to it, the next. 

Before moving on, however, we would like to re-iterate the fact that the best mech- 
anisms to eliminate nuisances is via marginalization or max-out. When decision-time 
constraints dictate that this is not feasible, canonization provides a useful way to re- 
duce complexity, ideally keeping the expected risk in the overall decision problem 
unchanged (Section 2.4.1). 
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Chapter 5 

Correspondence 



In this chapter we explore the critical role that multiple images play in visual decisions. 
"Correspondence" refers to a particular case of visual decision problem, whereby two 
adjacent images are assumed to portray the same scene (owing to the physical con- 
straints of causality and inertia, the scene is not likely to change abruptly from one 
image to the next), and this "bit" is used to prime the construction of models of the 
scene for recognition. Whether correspondence can be established depends on a vari- 
ety of conditions that we describe next. 

In the next section we will introduce the notion of Proper Sampling, that has been 
anticipated in Chapter 3, and is illustrated in Figure 5.1: A signal can be canonizable, 
but not properly sampled (i.e. , canonizability arises from aliasing effects, such as the 
block-structure due to quantization). Or, it can be properly sampled, but not canoniz- 
able (e.g. a constant region of the image), it can be not canonizable, and not properly 
sampled (e.g. a spatially and temporally independent realization of Gaussian "noise"), 
and finally it can be properly sampled and canonizable, for instance a "blob" or the 
fine-scale structure in a random-dot stereogram (Figure 5.2), not to be confused with 
"noise." 

In Chapter 3 we have seen that an invariant descriptor can be constructed via can- 
onization, but canonizability is a necessary, not sufficient, condition for meaningful 
correspondence. In order to be meaningful, a structure on the image needs to corre- 
spond to a structure in the scene, as we discussed in Remark 6. Unfortunately, this 
condition cannot be tested on one image alone. It can either be determined at decision 
time, by marginalization or extremization, or it can be determined - under the Lam- 
bertian assumption - by looking at different images of the same scene, as we describe 
next. 

5.1 Proper Sampling 

Feature detection is a form of sampling, but one that is rather different from the clas- 
sical sampling theory of Nyquist and Shannon. In traditional signal processing, proper 
sampling refers to regular sampling at twice the Nyquist rate, since a band-limited 
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Figure 5.1: A region of an image can be canonizable, but not properly sampled (left), 
or properly sampled, but not canonizable (right). Only when a region of the image is 
both properly sampled and canonizable can we establish meaningful correspondence 
between structures in the image and structures in the scene. 

signal can be reconstructed exactly from the samples under these conditions. This as- 
BAND-LIMITED sumes that the signal is band-limited, which is as much an idealization as the Lambert- 

Ambient static model (strictly band-limited signals do not exist, as they would have to 
have infinite spatial support). 

But we are not interested in using the data to reconstruct a replica of the "origi- 
nal signal." Instead, we are interested in using the data to solve a decision or control 
task as if we had the "original signal" at hand. But what is the "original signal" in 
our context? It is the image before any quantization phenomenon occurs. For a single 
image, since we cannot ascertain occlusions (Section 8.1), quantization represents the 
main non-invertible nuisance (neglecting complex illumination effects). Similarly, as 
we have seen in Theorem 3, the translation- scale group g = {x, a} is the main in- 
vertible nuisance (since more complex groups are not invertible once composed with 
quantization). Therefore, following (2.8) in Section 2.2, we have that if the image is 
I = h(g^ : u) + n, then the "ideal image" is 

/i(e,e,0)=p(7r- 1 (D)). (5.1) 

Therefore, in the context of visual decisions, we can define proper sampling as a dis- 
cretization that yields a response of co- variant detector functionals having the same 
number, type and connectivity of critical points that the "ideal image" would produce. 

Definition 10 (Proper Sampling) A signal I is properly sampled at a scale cr if a co- 
variant detector operating on the image yields the same critical points as if it operated 
on the radiance of the scene. In other words, ifip(I, g) is a co-variant detector for the 
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Figure 5.2: Texture or structure? that is not the question! This image helps under- 
stand the importance of the notion of "proper sampling." Since it is possible to fuse 
these two images in the random-dot stereogram binocularly and get a dense depth map 
[96], this means that point-to-point correspondence is possible, which would not be 
possible if this were a texture at the pixel scale, according to the definition in Section 
3.5. However, the important question for correspondence is not whether something is a 
texture (stationary) or structure (canonizable), but whether it is properly sampled. This 
cannot be decided in a single image, but by testing the proper sampling hypothesis of 
Definition 10. In the specific case of the random-dot stereogram, this is a properly 
sampled pair of images, that just happen to have a very complex radiance. Recall that 
proper sampling depends not just on the scene, but also on the imaging condition, in- 
cluding the distance from the scene and the point-spread function of the eye, both of 
which affect the scale of the representation. 



location-scale group g = {x, a}, and h(e, £, 0) = p o tt l (D), then g) = 4=> 

Vijj(h(g, £, 0)) = 0, and corresponding extrema are of the same type. 

ATI 

In other words, an image is properly sampled if it is possible to reconstruct, from the 
samples, not the "ideal image" (the radiance of the scene), but one that is topologically 
equivalent to to it, in the sense of yielding the same extrema via a co-variant detector 
operator. The attributed Reeb tree (ART) is a topological construction that can be as- 
sembled from the image at any given scale. It is a tree with nodes at every isolated 
extremum (maxima, minima, saddles), with edges connecting extrema that are "adja- 
cent" in the sense of the Morse-Smale complex. The value of the image at the extrema 
is not relevant, but the ordering is, and the position of these extrema is not relevant, but 
the connectivity is. The ART was introduced in [ ] as a maximal contrast- viewpoint 
invariant away from occlusions, and is described in more detail in Section A.2. It can 
be shown that the outcome g = {xi : a} of any feature detector V^(I, g) = operating 
at a scale a can be written in terms of the ART: {xi]f =1 = ART (I *Q(x;a 2 )), which 
leads to the following claim. 

Theorem 9 A signal I is properly sampled at a if and only if ART(h{e, £, 0) a 2 )) = 
ART(I*g(x;a 2 )). 

Thus, any of a number of efficient techniques for critical point detection [168] can be 
used to compute the ART and test for proper sampling. Note that, in general, any 
feature detection mechanism alters the position of the extrema relative to the original 
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signal, as illustrated in Example 5. This is not an issue in the context of visual deci- 
sions, because extrema move as a result of viewpoint changes, and their motion would 
therefore be discarded as a nuisance anyway. 

The problem with the definition of proper sampling is that it requires knowledge 
of the u ideal image" ft(e, £, 0), which is in general not available (indeed, it may be 
argued that it does not exist). Unlike classical sampling theory, 1 there is no "critical 
frequency" beyond which one is guaranteed success in canonization, because of the 
scaling/quantization phenomenon. Therefore, to test for proper sampling on the image 
I t (x), rather than comparing it to the "ideal image"(which is a function of the unknown 
radiance distribution of the scene, we compare it to the next image I t+1 {x), which is 
equivalent to it under the Lambertian assumption in the co-visible region [174], as done 
in Section 2.2. Therefore, the notion of proper sampling (now in space and time) relates 
to the notion of correspondence, co-detection, or "trackability" as we explain below. 

Definition 11 (Proper Sampling) We say that a signal {I t }f =1 can be properly sam- 
pled at scale a t at time t if there exists a scale <r t+1 such that ART(I t * Q(x; erf)) = 
ABT(It+i*g(x;o* +1 )). 

In other words, a signal is properly sampled in space and time if the feature detection 
mechanism is topologically consistent in adjacent times. Alternatively, we can impose 
that a t = a t +i and therefore require topological consistency at the same scale. Some- 
times we refer to proper sampling of corresponding regions in two images as being 
co-canonizable . 

Note that in the complete absence of motion, proper sampling cannot be ascer- 
tained. However, complete absence of motion is only real when one has one image, as 
a continuous capture device will always have some changes, for instance due to noise, 
making two adjacent images different, and therefore the notion of topological consis- 
tency over time meaningful, since extrema due to noise will not be consistent. Note 
that the position of extrema will in general change due to both the feature detection 
mechanism, and also the inter-frame motion. Again, what matters in the context of 
visual decisions is the structural integrity (stability) of the detection process, i.e. its 
topology, rather than the actual position (geometry). If a catastrophic event happens 
between time t and t + 1, for instance the fact that an extremum at scale a splits or 
merges with other extrema, then tracking cannot be performed, and instead the entire 
ARTs have to be compared across all scales in a complete graph matching problem. 

While establishing correspondence under these circumstances can certainly be done 
(witness the fact that we can recognize the top-left image in Figure 5.1 as a quantized 
photograph of Abraham Lincoln), this has to be done by marginalization, extremization 
or canonization, treating correspondence as a full-fledged recognition problem (a.k.a. 

WIDE-BASELINE MATCHING wide-baseline matching) rather than exploiting knowledge that the underlying scene is 

the same. This is the crucial "bit" provided by temporal continuity. Motivated by this 

TRACKABILITY reasoning, we introduce the notion of "trackability." 

Definition 12 (Trackability) A region of the image I\ B is trackable at a given scale if 
it is canonizable and properly sampled at that scale. 

l ln reality, even in classical sampling theory there is no critical sampling, since no real-signal is strictly 
band-limited, so in strict terms Nyquist's frequency does not exist. 
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It may seem at first that any region of an image can be made trackable by considering 
it at a sufficiently coarse scale. Unfortunately, this is not the case. 

Remark 9 (Occlusions) We first note that occlusions do, in general, alter the topology 
of the feature detection mechanism, hence the ART. Therefore, they cannot be properly 
sampled at any scale. This is not surprising, and intuitively we know that we cannot 
track through an occlusion, as correspondence cannot be established for regions that 
are visible in one image (either I t or It+i) but not the other. In the context of feature 
detectors/descriptors, occlusion detection reduces to a combinatorial matching process 
at decision time. Pixel-level occlusion detection will be discussed in Section 8.1. 

Remark 10 Note that a signal is not properly sampled per-se. In general, only trivial 
signals are globally properly sampled. However, there is a scale at which a signal 
may be properly sampled, as we will see shortly. As an illustration, Figure 5.3 shows 
the same image that is not properly sampled at the pixel level (left) but it is properly 
sampled when seen at a significantly coarser scale ( small in-set image on the left, or 
blurred version on the right). 




Figure 5.3: Whether an image is properly sampled depends on scale. The image on the 
left is not properly sampled at the finest scale afforded by the reproduction medium of 
this manuscript. However, it is properly sampled at a considerably coarser scale (right). 

Note that the notion of proper sampling relates to texture, in the sense that even if 
two images independently exhibit stationary statistics (and therefore they are indepen- 
dently classified as textures), if they are co-canonizable they are properly sampled, and 
therefore point-correspondence can be established (Figure 5.2). Also, we recall that 
we consider constant-color regions to be (trivial) textures, even though they are often 
referred to as "textureless." Note that a constant-intensity region is, in general, properly 
sampled, but not canonizable. A stochastic texture is not canonizable (by definition) at 
the native scale (sensor resolution), where it is usually not properly sampled, because 
canonizability arises from sampling/aliasing phenomena. Note, again, that all these 
considerations depend on scale, as we have discussed in Section 3.5. Also note that we 
are not assuming that the signals we measure are properly sampled. Instead, we use 
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the definition to test the hypothesis of proper sampling at a given scale, and therefore 
determine co-visibility and ascertain trackability. 

Although, as we have pointed out before, two-dimensional scale-spaces do not en- 
joy a causality property, typically the number of extrema diminishes at coarser scales, 
and the extrema that persist are, usually, corresponding to some structure in the image. 
So, although one cannot prove that for any image there exists a scale at which it can 
be properly sampled, in co-visible regions, one typically finds this to be the case for 
most natural images. This is because at a large enough scale extrema will coalesce, and 
eventually quantization phenomena become negligible. 2 

Therefore, one can conjecture that, for most natural images, in the absence of oc- 
clusions, assuming continuity and a sufficiently slow motion relative to the temporal 
sampling frequency, any image region is trackable. This means that there exists a 
large-enough scale cF max such that ART(I t * Q{x\ (J 2 max ) = ART(I t+1 * Q{x\ (J 2 max ). 

Thus anti-aliasing in space can lead to proper sampling in time. This is important 
because, typically, temporal sampling is performed at a fixed rate, and we do not want 
to perform temporal anti-aliasing by artificially motion-blurring the images, as this 
would destroy spatial structures. Note, however, that once a large enough scale is 
found, so correspondence is established at the scale a max , the motion g t computed at 
that scale can be compensated for, and therefore the (back-warped) images I t o g~ x can 
now be properly sampled at a scale a < cF max . This procedure can be iterated, until 
a minimum o- m i n can be found beyond which no topological consistency is found. 
Note that a min may be smaller than the native resolution of the sensor, leading to a 

SUPER-RESOLUTION super-resolution phenomenon. 

This suggests a procedure for tracking, whereby one first selects structurally stable 
features via proper sampling. The structural stability margin determines the neighbor- 
hood in the next image where detection is to be performed. If the procedure yields 
precisely one detection in this neighborhood, topology is preserved, and proper spatio- 
temporal sampling is achieved, hence trackability. Otherwise, a topological change 
has occurred, and the track is broken. This procedure is performed first at the coarsest 
level, and then propagated at finer scales by compensating for the estimated motion, 
and then re-selecting at the finer scales [116]. Note that this procedure, described in 
more detail in the next section, is different from traditional multi-scale feature tracking, 
where each feature detected at the finest scale is tracked at all coarser scales. In this 
framework, feature detection is initiated at each scale, in the region back-warped from 
coarser scales. 



5.2 The role of tracking in recognition 

The goal of tracking is to provide correspondence of (similarity or isometric) reference 
frames g^ = {xi, aij,Rij}, centered at Xi, with size cr^- and orientation Rij. This is 
the outcome of feature detection based on structural stability through proper sampling. 



2 However, at coarser scales, the landscape of the response of a co- variant detector becomes flatter, and 
therefore the BIBO gain smaller, and feature detection less reliable. 
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Because of temporal continuity, 3 the class label c is constant under the assumption 
of co-visibility and false otherwise. In this sense, time acts as a "supervisor" or a 
"labeling device" that provides ground-truth training data. The local frames g^ are now 
effectively co-detected in adjacent images. Therefore, the notion of structural stability 
and "sufficient separation" of extrema depends not just on the spatial scale, but also 
on the temporal scale. For instance, if two 5-pixel blobs are separated by 10 pixels, 
they are not sufficiently separated for tracking under temporal sampling that yields a 
20-pixel inter-frame motion. 

Thus the ability to track depends on proper sampling in both space and time. This 
suggests the approach to multi-scale tracking used in [1 16]: 

1. Construct a spatial scale-space, until the signal is properly sampled in time. 

2. Estimate motion at the coarser scale, with whatever feature tracking/motion es- 
timation/optical flow algorithm one wishes to use [ ]. This is now possible 
because the proper sampling condition is satisfied both in space and time. 4 

3. Propagate the estimated motion in the region determined by the detector to the 
next scale. At the next scale, there may be only one selected region in the corre- 
sponding frame, or there may be more (or none), as there can be singular pertur- 
bations (bifurcations, births and deaths). 

4. For each region selected at the next scale, repeat the process from 2. 

This is illustrated in Figure 5.4. 

Note that only the terminal branches of the selection scale-space provide an esti- 
mate of the frame g, whereas the hidden branches are used only to initialize the lower 
branches. Alternatively, one can report each motion estimate at the native selection 
scale (Figure 5.4 middle). 

This approach is called tracking on the selection tree (TST) [116], because it is 
based on proper sampling conditions and tracking is performed at each native selection 
scale. Indeed, note that the "second-moment-matrix" test commonly used for tracking 
[124, 186], even if performed at each scale, cannot be relied upon to reject features 
that cannot be tracked. This is because it is possible, and indeed typical, that due to 
singular perturbations, the region passes the second-moment test (e.g. harris' [76]) at 
each scale, but not because the feature of interest is trackable, but because additional 
structures have appeared in the new scale (Figure 5.4 right). This is a form of structural 
aliasing phenomenon. 

Tracking provides a time-indexed frame g t that can be canonized in multiple im- 
ages to provide samples of the canonized statistic (j) A h(^is t ), where v t lumps all other 

3 This temporal continuity of the class label does not prevent the data from being discontinuous as a 
function of time, owing for instance to occlusion phenomena. However, in general one can infer a description 
of the scene £, and of the nuisances g, u from these continuous data, including occlusions [54, 90], as we 
show in Section 8.1. It this were not the case, that is if the scene and the nuisances cannot be inferred from 
the training data, then the dependency on nuisances cannot be learned. 

4 In practice, there is a trade-off, as in the limit too smooth a signal will fail the trans versality condition 
(3.7) and will not enable establishing a proper frame g. 
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Figure 5.4: Tracking on the selection tree. The approach we advocate only pro- 
vides motion estimates at the terminal branches (finest scale); the motion estimated at 
inner branches is used to back-warp the images so large motion would yield properly - 
sampled signals at finer scales (left). As an alternative, the motion estimated at inner 
branches can also be returned, together with their corresponding scale (middle). Tra- 
ditional multi-scale detection and tracking, on the other hand, first "flattens" all se- 
lections down to the finest level (dashed vertical downwards lines), then for all these 
points considers the entire multi-scale cone above (shown only for one point for clar- 
ity). As a result, multiple extrema at inconsistent locations in scale-space are involved 
in providing coarse-scale initialization (right). Motion estimates at a scale finer than 
the native selection scale (thinner green ellipse), rather than improving the estimates, 
degrade them because of the contributions from spurious extrema (blue ellipses). Mo- 
tion estimates are shown on the right (blue = [116], green = multi-scale Lucas-Kanade 
[124, 13]). 

nuisances that have not been canonized (including the group nuisances that do not com- 
mute with non-invertible nuisances). These can then be used to build a descriptor, for 
instance a template (2.30), or more general ones that we discuss next. 

Note that the process of selection by maximizing the structural stability margin can 
be understood as a scale canonization process, even though in Section 3.4 we argued 
against canonization of scale. Note, however, that here we are not making an asso- 
ciation of features across scales other than for the purpose of initializing tracks. In 
particular, consider the example of the corner in Figure 5.4: The corner only exists 
at the finest scale. Features detected at coarser scales serve to initialize tracking at 
the finest scale, but features selected at coarse scales are not associated to the corner 
point, they are simply different local features. Aggregating multiple TSTs can also be 
done, when some "side information" is available that enables to group them together. 
For instance, in [207], detachability is used as side information [9], exploiting, again, 
knowledge of occlusions [7]. 



Chapter 6 

Designing feature descriptors 



A feature detector provides a G-covariant reference frame, relative to which the image 
is, by construction, G-invariant. So, the simplest invariant descriptor is the image 
itself, expressed in the reference frame determined by the detector (3.8). For the case 
of Euclidean and contrast transformations, G = SE(2) x Ad, where M denotes the 
set of monotonic continuous transformations of the range of the image. In the case 
in which a sequence of images is available, correspondence of reference frames is 
provided by tracking {gt}J=i, as described in Section 5.1, and again the entire time 
series in the normalized frame, I o g~ x is by construction G-invariant. Of course, all 
the non-invertible nuisances, as well as the group nuisances that do not commute with 
them, are not eliminated by this procedure, and therefore they have to be dealt with 
at decision time via either marginalization (2.12), or extremization (2.13). Using the 
formal image-formation model h, and the maximal G-invariant </> A , we have that 

^{{h}J =1 ) = {h{gi\iU,u t )}J =1 . (6.1) 

However, marginalizing or max-outing the residual effects of the nuisance {v t }J^i - 
that include occlusions, quantization, noise, and all un-modeled photometric phenom- 
ena such as specularities, translucency, inter-reflections etc. - can be costly at decision 
time. Therefore, in Section 2.6.1, we have justified the introduction of an alternate ap- 
proach that attempts to simplify the decision run-time by aggregating the training set 
into a template. Of course, one could do the same on the test set, which may consist 
of a single image T = 1, or of video. The result is what is called a feature descriptor. 
In this chapter we will explore alternate descriptors, more general than the best tem- 
plate described in Section 2.6.1, which we show how to compute in Section 6.3. We 
also discuss relationships to sparse coding in Section 6.5, which links to the discussion 
of linear detectors in Section 4.2. Recall that the "best template" in Section 2.6.1 is 
only the best among (static) templates. Therefore in Section 6.4 we show that one can, 
in general, achieve better results by not eliminating the time variable t as part of the 
feature description process, but retaining the time variable, and instead marginalizing 
it or max-outing it as part of the decision process. This requires the introduction of 
techniques to marginalize time, which we describe in Section 7.2. 
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Before doing all that, however, we summarize the role of various nuisances and 
how they are handled before the descriptor design process commences. 

6.1 Nuisance taxonomy (summary) 

This section summarizes material from previous chapters, and can therefore be skipped. 

In Section 2.2 we have divided the nuisance into those that have the structure of a group, g, and those that 
are non-invertible, u, including the additive component n, which we refer to as "noise." Then, in Section 
3.4, we have shown that of the group nuisance, only a sub-group commutes with v and n. These are the 
group-commutative nuisances, consisting of the isometric group of the plane. Indeed, these are only locally 
canonizable and therefore can be eliminated at the outset via co-variant detection and invariant description 
(3.8), modulo a selection process corresponding to combinatorial matching at decision time, to marginalize 
the occlusion nuisance, as described in (3.20) in Section 3.4. In this section we elaborate on how to treat the 
various nuisances, thus expanding (3.20). 

Translation: Canonization corresponds to the selection of a particular point (feature detection) that is cho- 
sen as (translational) reference. The image in this new reference is, by construction, invariant to 
translation, assuming that the feature detection process is structurally stable, commutes with non- 
invertible nuisances, and there is no occlusion. There are many mechanisms to canonize translation. 
Examples include the extrema of the determinant of the Hessian of the image, or of the convolution 
with a Laplacian of Gaussian, extrema of the determinant of the second-moment matrix. In any case, 
one must choose multiple translations T{ at each sample scale Oj . As we have discussed in Section 
4, scale should be sampled, not canonized (it does not commute with quantization), and it can be 
either sampled regularly, or sampled in a way that yields maximum structural stability margins, as 
described in Section 4.1. If the image is encoded by a "segmentation tree" (a partition of the domain 
into regions that are constant to within a certain tolerance a), then the centroid of each region (a 
node in the region adjacency graph) or the junctions of boundaries of three or more regions (faces in 
the adjacency graph) are also viable canonization procedures that can be designed to be structurally 
stable. 

Scale: Scale is not canonizable in the presence of quantization. Whatever descriptor one chooses should be 
represented at multiple scale. This is described in Section 5.2. 

Note that a variety of sampling options is possible, depending on the interplay with translational 
feature detection. One could first sample all available scales, and then independently canonize trans- 
lation in each one. Or, one could canonize translation at the native scale of the image, and then 
sample the image at multiple scales at that location. Or, one can simultaneously sample scale and 
detect translational frames by performing maximally-structurally stable translation selection, where 
the maximization of the stability margin is performed relative to scale. This is accomplished in a 
manner similar, but not identical, to scale selection as prescribed by [1 19], described in [116]. 

Rotation: Planar rotation is canonizable, and therefore it should be canonized. In the presence of measure- 
ments of gravity, a natural canonical orientation is provided by the projection of the gravity vector 
onto the image plane (assuming it is not aligned with the optical axis). 

If we have N?Ns regions available from sampling translations and scale, we can canonize each one 
with a co-variant detector and its corresponding invariant descriptor Co-variant detection can be 
performed in a number of ways, using extrema of the gradient direction [123] (at each location, at 
each scale), or the principal direction of the second moment matrix [134] at each location and scale, 
or nodes and faces of the adjacency graph of e-constant regions [ ], at each location and scale. 
The result is an unchanged number N?Ns of descriptors that are now invariant to rotation as well 
as (locally) translation. Alternatively, rotation can be canonized by using gravity and longitude (or 
latitude) as canonical references. 

Contrast: Contrast can be canonized independently of other nuisances by replacing the intensity at every 
pixel with the gradient direction: 

V/ 

cf>(I) = (6-2) 

This is conceptually straightforward, except that it assumes that the image is differentiable, which in 
general is not the case (one can define discrete differentiability which, however, entails a choice of 
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scale, thus making this choice still dependent on scale). It also raises issues when || V/|| = 0, i.e. , 
when the image is constant in a region. As an alternative mechanism, one can canonize contrast, in 
a local region determined by a translational frame Ti at a scale crj , by normalizing the intensity by 
subtracting the mean and dividing by the standard deviation of the image, or of the logarithm of the 
image. If we call Bij = B crj (x — Ti) a neighborhood of size (jj around the canonical reference Ti, 
then 

J-/j(J) 
std(J) 



4>ij{I)= izitV (6-3) 



f B . . !( x ) dx IB- - \I{x)-Kl)\ 2 dx 



where = — p — ^ — and std(J) = \ — - — j ^ • In this latter case, the canon- 

JBij x V l o 

ization procedure interacts with scale, and therefore contrast normalization should be done indepen- 
dently at all scales. Additional options if multiple spectral bands are available is to consider spectral 
ratios, for instance in an (R, G, B) color space one can consider the ratio R/B and G/B, or the spectral r 
normalized color space, or in an (if, 5, V) color space one can consider the H and S channels, etc. 

Among the nuisances that cannot be canonized, in addition to scale and occlusion that should be sampled 
and marginalized as in Section 3.4, we have: 

Illumination: Complex illumination effects other than contrast cannot be canonized, and therefore their 
treatment has to be deferred to decision time. 

Quantization and noise: Quantization is intertwined with scale and cannot be canonized. At each scale, 
quantization error can be lumped as additive noise. Detectors for canonizable nuisances should be 
designed to commute with quantization and noise. Linear detectors do so by construction. 

Skew: One could treat the (non-canonizable) group of skew transformation in the same way as scale, but 
since there is no meaningful sampling of this space, we lump the skew with other deformations and 
defer their treatment to training or decision. 

Deformations and quantization: Domain deformations other than rotations and translations, including the 
affine and projective group or more general diffeomorphisms, cannot be canonized in the presence of 
quantization, and therefore their handling should be deferred to either training (blurring) or testing 
(marginalization, extremization) . 

Occlusion: Occlusions, finally, are not invertible and cannot be canonized, so they will have to be ex- 
plicitly detected during the matching process (via max-out or marginalization), which in this case 
corresponds to a selection of a subset of the N?Ns local descriptors as described in Section 3.4. 

What we have in the end, for each image /, is a set of multiple descriptors (or tem- 
plates), one per each canonical translation and, for each translation, multiple scales, 
canonized with respect to rotation and contrast, but still dependent on deformations, 
complex illumination and occlusions: 

(f)(1) = {kij o p(SjRijX + TiVij(x)) + riij(x), (6.4) 
i, j = 1, . . . N T , N s \B aj (x + Ti) H D = 0} 

where Vij is the residual of the diffeomorphism w(x) after the similarity transformation 
SRx+T has been applied, i.e. , (x) = w(x)— SjRijX — Ti. Here S = crl is a scalar 
multiple of the identity matrix that represents an overall scaling of the coordinates, 
and (RjT) G SE(2) represent a rigid motion with rotation matrix R G SO (2) and 
translation vector T G M 2 . If we call the frame determined by the detector = 
{Sj , Ti , Rij , k^}, we have that 



<Kl) = {IogJ}»T£ a . (6.5) 



Note that the selection of occluded regions, which is excluded from the descriptor, 
is not know a-priori and will have to be determined as part of the matching process. 
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As we have discussed in Section 4.1, the selection process should be designed to be 
invariant with respect to group-commutative nuisances, and "robust" (in the sense of 
structural stability) with respect to all other nuisances. The selection and tracking 
process described in Section 5.1 provides one such design. 
TIME SERIES In the case of video data, {It}t=i> one obtains a time series of descriptors, 

4>({It}?=i) = {It o &(t)}?T£*< T (6.6) 

where the frames g^ (t) are provided by the feature detection mechanism that, in the 
case of video, includes tracking via proper sampling (Section 5.1). Once that is done, 
one should store the time series of descriptors (j)({I t }f = i) for later marginalization, if 
sufficient storage and computational power is available, as we describe in Section 6.4. 
Otherwise, a static descriptor can be devised, as we discuss in Section 6.3. Before 
doing so, however, we discuss the interplay between the scene and the nuisance in the 
next section. 



6.2 Disentangling nuisances from the scene 

In the image-formation model described in Section 2.2, and in the more general model 
in Section B.l, some of the nuisances interact with the scene to form an image. If the 
nuisances are marginalized or max-outed, as in (2.12) or (2.13), this is not a problem. 
However, if we want to canonize a nuisance, in the process of making the feature 
invariant to the nuisance, we may end up making it also invariant to some components 
of the scene. In other words, by abusing canonization we may end up throwing away 
the baby (scene) with the bath water (nuisances). 
VIEWPOINT-SHAPE INTERACTION The simplest example is the interaction of viewpoint and shape. In the model (2.6), 

we see immediately that the viewpoint g and shape S interact in the motion field (2.9) 
via w(x) = 7rg7r~ 1 (x), where p = tt~ 1 (x) G S depends on the shape of the scene. It 
is shown in [177] that the group closure of domain warpings w span the entire group 
of diffeomorphisms, which can therefore be canonized - if we exclude the effects of 
occlusion and quantization. However, necessarily the canonization process eliminates 
the effects of the shape S in the resulting descriptor, which is the ART described in 
Section 5.1. This had already been pointed out in [197]. This means that if we want 
to perform recognition using a strict viewpoint-invariant, then we will lump all objects 
that have the same radiance, modulo a diffeomorphism, into the same class. That means 
that, for instance, all white objects are indistinguishable using a viewpoint invariant 
statistic, no matter how such an invariant is constructed. Of course, as pointed out in 
[197], this does not mean that we cannot recognize different objects that have the same 
radiance. It just means that we cannot do it with a viewpoint invariant, and instead we 
have to resort to marginalization or extremization. 

The same phenomenon occurs with reflectance (a property of the scene) and il- 

REFLECTANCE-ILLUMINATION INTERACTION lumination (a nuisance). This is best seen from a more general model than (2.6), for 

instance one of the models discussed in Section B. 1 . In the case of moving and possibly 
deforming objects, there is also an ambiguity between the deformation (a characteristic 
of the scene) and the viewpoint (a nuisance). 
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Deciding how to manage the scene-nuisance interaction is ultimately a modeling 
choice, that should be guided by two factors. The first is the priority in terms of speed 
of execution (biasing towards canonizing nuisances) vis-a-vis discriminative power (bi- 
asing towards marginalization to avoid having multiple scenes collapse into the same 
invariant descriptor). The second is a thorough understanding of the interaction of the 
various factors and the ambiguities in the image formation model. This means that 
one should understand, given a set of images, what is the set of all possible scenes that, 
under different sets of nuisances, can have generated those images. This is the set of in- 
distinguishable scenes, that therefore cannot be discriminated from their images. This 
issue is very complex and delicate, and a few small steps towards a complete analysis 
are described in Section B.3. 

In this chapter, we will set this issue aside, and agree to canonize (locally) transla- 
tion, rotation and contrast, and sample scale. This means that scenes that are equivalent 
up to a similarity transformation are indistinguishable, as shown in [125], which is not 
a major problem, but leaving the reflectance-illumination ambiguity unresolved. In the 
next section, we move on to describe some of the descriptors that can be constructed 
under these premises. 



6.3 Template descriptors for static/rigid scenes 

If we are given a sequence of images {I t } of a static scene, or a rigid object, then the 
only temporal variability is due to viewpoint g t , which is a nuisance for the purpose 
of recognition, and therefore should be either marginalized/max-outed or canonized. 
In other words, there is no "information" in the time series {gt}, and once we have 
the tracking sequence available, the temporal ordering is irrelevant. This is not the 
case when we have a deforming object, say a human, where the time series contains 
information about the particular action or activity, and therefore temporal ordering is 
relevant. We will address the latter case in Section 7.2. For now, we focus on rigid 
scenes, where S does not change over time, or rigid objects, which are just a simply 
connected component of the scene Si (detached objects). The only role of time is to 
enable correspondence, as we have seen in Section 5.1. 

The simplest descriptor that aggregates the temporal data is the best template de- 
scriptor introduced in Section 2.6. 1 . However, after we canonize the invertible-commutative 
nuisances, via the detected frames g t , we do not need to blur them, and instead we can 
construct the template (2.30) where averaging is only performed with respect to the 
nuisances v, rather than (2.25), where all nuisances are averaged-out. The prior dP(v) 
is generally not known, and neither is the class-conditional density dQ c (£). However, 
if a sequence of frames 1 {gk}k=i nas t> een established in multiple images {Ik}k=i> 



1 We use k as the index, instead of t, to emphasize the fact that the temporal order is not important in this 
context. 
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with 1^ = h(gkj £fc, z'fc), then it is easy to compute the best (local) template via 2 
0(/ c ) = ^ 0(I)dP(I\c) = M(Mfc^fc) = X) /o ^ fc = S 

(6.7) 

where (j)ij(Ik) are the component of the descriptor defined in eq. (6.4) for the k-th 
image Ik. A sequence of canonical frames {gk}J=i is the outcome of a tracking pro- 
cedure (Section 5.1). Note that we are tracking reference frames g^, not just their 
translational component (points) Xi, and therefore tracking has to be performed on 
the selection tree (Figure 5.4). The template above I c , therefore, is an averaging of 
the gradient direction, in a region determined by c/k, according to the nuisance dis- 
tribution dP{y) and the class -conditional distribution dQ c {£), as represented in the 
training data. This "best-template descriptor" (BTD) is implemented in [ ]. It is 
related to [46, 19, 185] in that it uses gradient orientations, but instead of performing 
spatial averaging by coarse binning, it uses the actual (data-driven) measures and av- 
erage gradient directions weighted by their standard deviation over time. The major 
difference is that composing the template requires local correspondence, or tracking, 
of regions fa, in the training set. Of course, it is assumed that a sufficiently exciting 
sample is provided, lest the sample average on the right-hand side of (6.7) does not 
approximate the expectation on the left-hand side. Sufficient excitation is the goal of 
active exploration as described in Chapter 8. 

Note that, once the template descriptor is learned, with the entire scale semi-group 
spanned in dP(u) 3 recognition can be performed by computing the descriptors faj at a 
single scale (that of the native resolution of the pixel). This significantly improves the 
computational speed of the method, which in turn enables real-time implementation 
even on a hand-held device [ ]. It should also be noted that, once a template is 
learned from multiple images, recognition can be performed on a single test image. 

It should be re-emphasized that the best-template descriptor is only the best among 
templates, and only relative to a chosen family of classifiers (e.g. nearest neighbors 
with respect to the Euclidean norm). For non-planar scenes, the descriptor can be 
made viewpoint-invariant by averaging, but that comes at the cost of losing shape dis- 
crimination. If we want to recognize by shape, we can marginalize it, but that comes at 
a (computational) cost. 

It should also be emphasized that the template above is a first-order statistic (mean) 
from the sample distribution of canonized frames. Different statistics, for instance 
the median, can also be employed [116], as well as multi-modal descriptions of the 
distribution [207] or other dimensionality reduction schemes to reduce the complexity 
of the samples. 



2 As pointed out in Section 2.6.1, this notation assumes that the descriptor functional acts linearly on the 
set of images X; although it is possible to compute it when it is non-linear, we make this choice to simplify 
the notation. 

3 Either because of a sufficiently rich training set, or by extending the data to a Gaussian pyramid in 
post-processing. 
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6.4 Time HOG and Time SIFT 

The best-template descriptor in the previous section is a first-order statistic of the con- 
ditional distribution p(I\c). As an alternative, instead of averaging the sample, one 
can compute a histogram, thus retaining all statistics. For this to be done properly, one 
has to consider each image to be a sample from the class-conditional distribution. A 
wildly simplifying assumption is to assume that every pixel is independent (obviously 
not a realistic assumption) so that the conditional distributions can be aggregated at 
each pixel. This is akin to computing, for every location x in a canonized frame gk, 
the temporal histogram of gradient orientations. This statistic eliminates time and dis- 
cards temporal ordering, but retains the distributional nature of the data as opposed to 
aggregating it into a first-order statistic. 

Example 7 (SIFT and HOG revisited) If instead of a sequence {Ik} one had only 
one image available, one could generate a pseudo -training set by duplicating and 
translating the original image in small integer intervals. The procedure of building 
a temporal histogram described above then would be equivalent to computing a spatial 
histogram of gradient orientations. Depending on how this histogram is binned, and 
how the gradient direction is weighted, this procedure is essentially what SIFT [123] 
and HOG [ ] do. So, one can think of SIFT and HOG as a special case of template de- 
scriptor where the nuisance distribution dP(v) is not the real one, but a simulated one, 
for instance a uniform scale-dependent quantized distribution in the space of planar 
translations. 

We call the distributional aggregation, rather than the averaging, of {<t>ij(Ik)} m (6.7) 
the Time HOG or Time SIFT, depending on how the samples are aggregated and binned 
into a histogram. It has been proposed and tested in [156]. 

As an alternative to temporal aggregation via a histogram, one could perform ag- 
gregation by dimensionality reduction, for instance by using principal component anal- 
ysis, or a kernel version of it as done in [133], or using sparse coding as we describe in 
the next section. 

Although a step up from template descriptors, Time SIFT and Time HOG still 
discard the temporal ordering in favor of a static descriptor. In cases where the temporal 
ordering is important, as in the recognition of temporal events, one should instead 
retain the time series {(j>(I t )} and compare them as we describe in Section 7.2, which 
corresponds to marginalizing, or max-outing, time. This process is considerably more 
onerous, computationally, at decision time. Before doing so, we illustrate an alternate 
approach to build local descriptors based on ideas from sparse coding [144]. 

6.5 Sparse coding and linear nuisances 

Many nuisances, whether groups or not, act linearly on the (template) representa- 
tion. Take for instance the instantiation of the Lambert-Ambient model (2.6), and 
assume that we have canonized contrast m (or, equivalently, after canonization, as- 
sume m = Id). The diffeomorphic warping w includes canonizable nuisances, group 
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nuisances that are not canonizable, and nuisances that are not invertible, such as occlu- 
sions, quantization, and additive noise. However complex as it may be, w is a linear 
functional^ acting on the radiance function p\ 

W : X X; p^Wp = po w' 1 . (6.8) 

The kernel that represents it is degenerate, in the sense that it is not an ordinary function 
but a distribution (Dirac's Delta) 



Wp = p(w 1 x) = / S(w 1 x — y)p(y)dy x G D\Q 
Jr 2 



(6.9) 



where Q denotes the occluded region. The kernel an ordinary function once we intro- 
duce sampling into the model: 



I(xi) = / Wp(x)dx + n(xi) = (6.10) 

I S(w~ 1 x — y)p(y)dxdy + n(xj) (6.11) 
B a (xi) Jr 2 

= / G(w~ 1 x i -y;a)p(y)dy^n(x i ) (6.12) 
Jr 2 

where the kernel Q{-\o) is the convolution of S with the characteristic function of the 
sampling domain B a . This shows that the compound effect of quantization and domain 
deformations is the convolution with a deformed sampling kernel. The kernel may also 
include an anti-aliasing filter. 5 

An alternative is to approximate p with piecewise linear or even piecewise constant 
functions, which can be done arbitrarily well under mild assumptions as described in 
Section 4.3, and encode the domain partition where the function is e-constant: 

{Si}? = \ I Si = D\Sl, Si H Sj = dij. (6.13) 
The function p can then be written as 

N S 

p{x) = ^2 PiXSi 0) • (6. 14) 

i=l 



4 A linear functional is a map W from a function space to a vector space that satisfies the linearity 
assumption, that is W(afi + fif<2) = cnW(fi) + PW(f2) for any functions fi, fa an scalars a, (3. In 
general, linear maps can be written as the integral of the argument against a kernel, [160]. 

5 One may notice that the integral is performed over the entire plane R 2 , which may raise the concern that 
the procedure has obvious memory and computational limitations. The fact that the domain is not bounded 
can be addressed by noticing that the value of the radiance p outside a ball of radius 3cr around the point 
w~ 1 Xi is essentially irrelevant since Q is typically (exponentially) low-pass. One could also discretize p, 
but that is also fraught with difficulties: How many samples should we store? Would p obey the conditions 
of Nyquist-Shannon's sampling theorem? In general we cannot expect the radiance p to be effectively 
band-limited: Real-world reflectance has sharp discontinuities due to material transitions, cast shadows and 
occlusions. One could coerce the problem into the narrow confines of the sampling theorem, but then low- 
pass filtering would be needed to fit the existing memory and computational constraints, leading us back to 
the place we started. 
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While the group w(x) is, in general, not piecewise constant, its variation within each Si 
is not observable, and therefore we can without loss of generality assume that w(x) = 
WiV x G S{. Naturally, there remains to be determined which indices i correspond to 
regions that are visible in both the template and the target image, and which ones are 
partially or completely occluded. 

A third alternative, motivated by the statistics of natural images [144], is to assume 
that p(y), while not smooth, is sparse with respect to some basis. This means that there 
are functions bi(y), . . . , 6jv(2/) and coefficients . . . , ajy such that, for all y G 1R 2 , 
we have 



TV 



(6.15) 



for some finite N, where we have used the vector notation b = [bi , . . . , 6 at] and a 
[ai, . . . , ajv] T . Plugging the equation above into (6.12), we have 



= / G(w 1 x i -y;cr)b(y)dya + n(x i ). 

JR 2 



(6.16) 



If we consider the joint encoding of every pixel xi in a region centered at the canonized 
locations Tj of size corresponding to the sample scales Sjk, we have 



- I( X1 ) - 


= [ 


. I(xn) 


J R 2 



Q(T, x x x - y;(Tjk) 



_ Q(T, 1 x N - y;a jk ) 



b(y)dya = Wba jk (6.17) 



from which one can see that the coefficients representing the "ideal image" p with 
respect to an over-complete basis b are the same as the coefficients representing the 
"actual" image /, relative to a transformed basis B = Wb, whose elements are Bi m = 
J Q(T~ x x\ — y;(ijk)b rn (y)dy. Note that the dependency on j (the location) and k 
(scale) is reflected in the coefficients a^. To obtain a description, one can estimate, 
for each j, k, the coefficients 



& jk = argmin \\I jk - Wba\ 



(6.18) 



from which the representation is given by 

£jk = boijk- 



(6.19) 



This requires knowing the basis 6, which in turn requires solving the above problem 
for all corresponding local regions of all images in the training set. This can be done 
by joint alignment as in [196, 199]. 

Obviously, the linear model does not hold in the presence of occlusions. Therefore, 
the representation ^ is only local where co- visibility conditions are satisfied. Occlu- 
sions, as usual, have to be marginalized or max-outed at decision time as described in 
Section 3.4. 
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6.5.1 Dictionary Hypercolumns 

As we have argued, it is possible to estimate an invariant representation from the image 
by assuming that the radiance function is sparse. This led to the observation that the 
representation of the "true" (but unknown) radiance p with respect to an over-complete 
basis b is the same of the representation of the known image I with respect to a trans- 
formed basis Wb. Note that, in general, this representation will not be unique, and 
there are some issues with the coherence of the basis that are beyond the scope of this 
manuscript (see instead [ ]). However, the projection (hallucination) of the represen- 
tation onto the space of images can be used to compute the residual, so any ambiguity 
in a is annihilated by the corresponding choice of basis vectors Wb. 

A consequence of this fact is that one can just augment the dictionary with trans- 
formed versions of the bases, thus obtaining an enlarged dictionary, and then use ex- 
actly the same algorithm for encoding p and /. The only difference is that one would 
have to organize the dictionary by "linking" bases that are transformed versions of each 
other, or "deformation hypercolumns" 

Ba = [Wh, . . . ,Wb n ]a (6.20) 

somewhat akin to "orientation hypercolumns" in visual cortex. In practice, one would 
sample the group w according to its prior (which is a combination of dP{v) and 
dP(g)), yielding a set of samples {Wj}, from which the enlarged basis can be con- 
structed, with elements Wjbi, and the understanding that any non-zero element of a 
multiplying a set of bases Wj b{ will have a non-zero coefficient . 

Unfortunately, the coefficients in this representation are not identifiable: Indeed, 
not only are they not unique, but they are not continuously dependent on the data, so it 
is possible for infinitesimally close images to have wildly different set of coefficients. 
This is not an issue if the goal is data transmission and storage, where the only role 
of the representation is to generate a faithful copy of the original (un-encoded) signal. 
However, in order to use a representation for decision or control, this program does not 
carry through. 



Chapter 7 

Marginalizing time 




Figure 7.1: With his pioneering light-dot displays, Johansson [ ] showed that a great 
deal of "information" is encoded in the temporal dynamics of visual signals. For in- 
stance, by the motion of a collection of points positioned at the joints, one can easily 
infer the type of action, the age group of the actor, his or her mood, etc. despite the fact 
that all pictorial cues have been eliminated from the image, and therefore such quality 
of the action cannot be inferred from a static signal (one frame of the sequence). 

As we have discussed in the previous chapter, templates average the data with re- 
spect to the distribution of the nuisances, regardless of temporal ordering. Temporal 
continuity provides a mechanism for tracking, as discussed in Section 5.1, but is oth- 
erwise irrelevant for the purpose of describing static scenes or rigid objects. Temporal 
averaging in the template is, in general, suboptimal, so one could consider statistics 
other than the average, for instance the entire temporal distribution, as we have done 
in Section 6.4. While that approach improves the descriptor in the sense of retaining 
the distributional properties, it still does not respect temporal ordering. These are just 
two approaches to canonizing time by either averaging or binning. Other approaches 
include a variety of "spatio-temporal interest point detectors" [205, 113] or other ag- 
gregated spatio-temporal statistics [47]. 
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ACTIONS, EVENTS If we are interested in visual decisions involving classes of "objects that have 

a temporal component, then all approaches that "canonize" time necessarily mod-out 
also the temporal variation of interest, in the same way in which canonizing viewpoint 
eliminates the dependency of the invariant descriptor on the shape of the scene (Section 
6.2). At the opposite end of the spectrum one could retain the entire time series (the 
temporal evolution of feature descriptors) (6.1), and compare them as functions of 
time using any functional norm in the calculation of the classifier (Section 2.1). This 
would not yield a viable result because the same action, performed slower or faster, or 
observed from a different initial time, would result in completely different time series 
(6.1), if not properly managed. Proper management here means that time should not 
just be canonized (as in a template or Time HOG) or ignored (by comparing time series 
as functions), but instead should be marginalized or max-outed. The only difference 
between these approaches is the prior on time, which is typically either uniform or 
exponentially decaying, so we will focus on a uniform prior, corresponding to a max- 
out process for time. 

The simplest mechanism to max-out time is known as dynamic time warping (DTW). 
dynamic time warping Dynamic time warping consists in a reparametrization of the temporal axis, via a func- 

DTW tion r <— h(t), that is continuous and monotonic (in other words, a contrast transfor- 

mation as we have defined it in Section 2.2). Accordingly, the dynamic time warping 
distance is the distance obtained by max-outing the time warping between two time 
series. The problem with DTW is that it preserves temporal ordering but little else. 
The moment one alters the time variable, all the velocities (and therefore accelerations, 
and therefore forces that generated the motion) are changed, and therefore the outcome 
of DTW does not preserve any of the dynamic characteristics of the original process 
that generated the data. In other words, if the time series was generated by a dynamical 
model, DTW does not respect the constraints imposed by the dynamics. This means 
that a large number of different "actions" are lumped together under DTW. The pi- 
oneering work of the psychologist Johansson, however, showed that a great deal of 
information can be encoded in the temporal signal (Figure 7.1). This information is 
destroyed by DTW. 

The next two sections describe DTW and point the way to generalizing it to take 
into account dynamic constraints. The reader who is uninterested in the subtleties 
of recognizing events or actions that have similar appearance but different temporal 
signatures can skip the rest of this chapter. 



7.1 Dynamic time warping revisited 

If we consider two time series of images, I\ and I2, where Ij = {Ij(t)}f =1 , for sim- 
plicity assumed to have the same length, 2 then the simplest distance we could define is 
the L 2 norm of the difference, do(h, 1 2) = Jq \\h (t) — h(t) \\ 2 dt, which corresponds 
to a generative model where both sequences come from an (unknown) underlying pro- 



x We call such objects "actions" or "events." 

2 The case of different lengths can be also considered at the cost of a more complex optimization. 
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cess 3 {h(t)}, corrupted by two different realizations of additive white zero-mean Gaus- 
sian "noise" (here the word noise lumps all unmodeled phenomena, not necessarily 
associated to sensor errors) 

Ij(t) = h(t)+n 3 (t) j = 1,2; t G [0, T] (7.1) 

The L 2 distance is then the (maximum-likelihood) solution for h that minimizes 

d (hj 2 ) = mm</> data (I 1 , I 2 \h) = V / \\n 3 (t)\\ 2 dt (7.2) 

subject to (7.1). Here h can be interpreted as the average of the two time series, and 
although in principle h lives in an infinite-dimensional space, no regularization is nec- 
essary at this stage, because the above has a trivial closed- form solution. However, later 
we will need to introduce regularizers, for instance of the form (j) reg (h) = J Q T 1 1 V/i 1 1 dt . 
This admittedly unusual way of writing the L 2 distance makes the extension to more 
general models simpler, as we discuss in the next sections. 

Consider now an arbitrary contrast transformation 4 m of the interval [0, T], called 
a time warping, so that (7.1) becomes 

I j (t) = h(m j (t))+n j (t) J = 1,2. (7.3) 

The data term of the cost functional we wish to optimize is still Y^i=i Jo \\ n j (f) \\ 2 dt, 
but now subject to (7.3), so that minimization is with respect to the unknown func- 
tions mi and rri2 as well as h. Since the model is over-determined, we must impose 
regularization [105] to compute the time-warping distance 

di(h,h)= min (j)data{h, l2\h,m 1: m 2 ) + (j) r eg{h). (7.4) 

Here H is a suitable space where h is assumed to live and M is the space of monotonic 
continuous (contrast) transformations. In order for r = m(t) to be a viable temporal 
index, m must satisfy a number of properties. The first is continuity (time, alas, does 
not jump); in fact, it is common to assume a certain degree of smoothness, and for the 
sake of simplicity we will assume that mi is infinitely differentiable. The second is 
causality: The ordering of time instants has to be preserved by the time warping, which 
can be formalized by imposing that rrii be monotonic. We can re- write the distance 
above as 

h ™ n „E / \\Ij(t) - HrnAtM 2 + M\Vh(t)\\dt (7.5) 

3 = 1 

where A is a tuning parameter that can be set equal to zero, for instance by choosing 
h(t) = ii(m^ 1 (t)), and the assumptions on the warpings mi are implicit in the defi- 
nition of the set M. This is an optimal control problem, that is solved globally using 
dynamic programming in a procedure called "dynamic time warping" (DTW). 

3 The notation we use in this chapter abuses the symbols defined previously, but is chosen on purpose so 
that the final model will resemble those of previous chapters. 

4 Note that this is not the contrast transformation operating on the image values, introduced in Section 
2.2, but it is a contrast transformation applied to the temporal domain. 
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It is important to note that there is nothing "dynamic " about dynamic time warping, 
other than its name. There is no requirement that the warping function x be subject to 
dynamic constraints, such as those arising from forces, inertia etc. However, some 
notion of dynamics can be coerced into the formulation by characterizing the set M. in 
terms of the solution of a differential equation. Following [155], as shown by [129], 
one can represent allowable x G M in terms of a small, but otherwise unconstrained, 
scalar function u: M = {m G H 2 ([0, T]) \x = ux\ u G L 2 ([0,T])} where H 2 
denotes a Sobolev space. If we define 5 si = rhi then s = us; we can then stack the 
two into 6 g = [m, s] T , indicative of a group (invertible nuisance) and C = [1, 0], and 
write the data generation model as 

(g j (t) = f(g j (t))+l(g j (t))u i (t) 
\l j (t) = h(Cg j (t))+n i (t) 

as done by [129], where ui G £ 2 ([0, T]). Here /, / and C are given, and h, raj(0), 
are nuisance parameters that are eliminated by minimization of the same old data term 
Y^j=i lo \\ n j (t)\\ 2 dt, now subject to (7.6), with the addition of a regularizer \(j) reg (h) 

and an energy cost for m, for instance 4> energy^) = J Q T ||i^|| 2 <it. Writing explicitly 
all the terms, the problem of dynamic time warping can be written as 

d 3 (I 1 ,I 2 )= min V/ ||/,(t)-/i(C^-(t))||+A||V/i(t)||+/i||^(t)||^ (7.7) 

h,Uj,mj ^ J 

subject to gj = f(gi) + l(gj)ui. Note, however, that this differential equation is only 
an expedient to (softly) enforce causality by imposing a small "time curvature" Ui. 



7.2 Time warping under dynamic constraints 

The strategy to enforce dynamic constraints in dynamic time warping is illustrated in 
Figure 7.2: Rather than the data being warped versions of some common function, as 
in (7.3), we assume that the data are outputs of dynamical models driven by inputs 
that are warped versions of some common function. In other words, given two time 
series {7^}, i = 1, 2, we will assume that there exist suitable matrices A, B, C, state 
functions ra$ of suitable dimensions, with their initial conditions, and a common input 
u such that the data are generated by the following model, for some warping functions 
Wi G M: 

m 3 (t) = Am 3 (t) + Bu(wj(t)) 
I j (t) = Cm j (t)+n j (t). 

Our goal is to find the distance between the time series by minimizing with respect 
to the nuisance parameters the usual data discrepancy Y^j=i lo \\ n j(i)\\ 2 dt subject to 

5 Si is not to be confused with the scale parameter. 

6 g in this section is not to be confused with viewpoint; the notation g is used because, consistent with the 
notation adopted earlier, it is an invertible nuisance. 
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DTW ! 
TWDC 



Figure 7.2: Traditional dynamic time warping (DTW) assumes that the data come from 
a common function that is warped in different ways to yield different time series. In 
time warping under dynamic constraints (TWDC), the assumption is that the data are 
the output of a dynamic model, whose inputs are warped versions of a common input 
function. 



(7.8), together with regularizing terms (j) reg (u) and with Wj G M. Notice that this 
model is considerably different from one discussed in the previous section, as the state 
g earlier was used to model the temporal warping, whereas now it is used to model the 
data, and the warping occurs at the level of the input. It is also easy to see that the 
model (7.8), despite being linear in the state, includes (7.6) as a special case, because 
we can still model the warping functions Wi using the differential equation in (7.6). In 
order to write this time warping under dynamic constraint problem more explicitly, we 
will use the following notation: 

I(t) = Ce At I(0) + / Ce A ^- r) Bu(w(r))dr = L (x(0)) + L t (u(w)) (7.9) 
Jo 

in particular, notice that L t is a convolution operator, L t (u) = F * u where F is the 
transfer function. We first address the problem where A,B,C (and therefore L t ) are 
given. For simplicity we will neglect the initial condition, although it is easy to take 
it into account if so desired. In this case, we define the distance between the two time 
series 

d 4 (/i,/ 2 ) =minV f \\Ij(t) -L t ( Uj (t))\\ + X\\ Uj (t) - u (wj(t))\\dt (7.10) 

subject to uo G H and Wj G M. Note that we have introduced an auxiliary variable 
which implies a possible discrepancy between the actual input and the warped version 
of the common template. This problem can be solved in two steps: A deconvolution, 
where Ui are chosen to minimize the first term, and a standard dynamic time warping, 
where Wi and uq are chosen to minimize the second term. Naturally the two can be 
solved simultaneously. 

When the model parameters A,B,C are common to the two models, but otherwise 
unknown, minimization of the first term corresponds to blind system identification, 
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which in general is ill-posed barring some assumption on the class of inputs U{ . These 
can be imposed in the form of generic regularizers, as common in the literature of blind 
deconvolution [64], or by restricting the classes of inputs to a suitable class, for instance 
sparse "spikes" [157]. This is a general and broad problem, beyond our scope here, so 
we will forgo it in favor of an approach where the input is treated as the output of an 
auxiliary dynamical model, also known as exo-system [84], This combines standard 
DTW, where the monotonicity constraint is expressed in terms of a double integrator, 
with TWDC, where the actual stationary component of the temporal dynamics is esti- 
mated as part of the inference. The generic warping w, the output of the exo-system 
satisfies 

iw j {t)=g j {t), j = 1,2 (7U) 
\gj(t) =Vj(t)gj{t) 

and 7 Wj (0) =0, Wj (T) = T. This is a multiplicative double integrator; one could con- 
ceivably add layers of random walks, by representing Vi are Brownian motion. Com- 
bining this with the time-invariant component of the realization yields the generative 
model for the time series U : 

= 9j(t), i = M 

gj(t) = Vi(t)gj(t) ^ 
rhj(t) = Arrijit) + Bu(wj(t)) 
J j (t) = Cm j (t)+n j (t). 

Note that the actual input function u, as well as the model parameters A,B,C, are 
common to the two time series. A slightly relaxed model, following the previous sub- 
section, consists of defining Ui(t) = u(wi(t)), and allowing some slack between the 
two; correspondingly, to compute the distance one would have to minimize the data 
term 

T 



te (/i,/2h,^i,A,B,C) = y] / \\nj(t)\\ 2 dt (7.13) 

3 = 1 J ° 



subject to (7.12), in addition to the regularizers 

2 r T 



(v i ,u) = Y j [ ||^)|| 2 + ||V^)|| 2 ^ (7.14) 
i=i Jo 



which yields a combined optimization problem 



2 j 1 

d 5 (h,I 2 )= min E / (ll^(*)-Cm j (t)|| 2 + ||,; j (t)|| 2 + ||V«(t)|| 2 )dt 

(7.15) 

subject to (7.12). This distance can be either computed in a globally optimal fashion 
on a discretized time domain using dynamic programming, or we can run a gradient 



1 Wj in this section is not to be confused with a diffeomorphism of the image domain. It is, however, a 
diffeomorphism of the time domain, hence the choice of notation. 
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descent algorithm based on the first-order optimality conditions. The bottom line is that 
one could use any of the distances introduced in this section, do, . . . , d 5 , to compare 
time series, corresponding to different forms of marginalization or max-out. 

This concludes the treatment of descriptors. The next chapters focus on how to 
construct a representation from data, and how to aggregate different objects under the 
same category as part of the learning process. 
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Chapter 8 

Visual exploration 



In this chapter we study the inverse problem of hallucination, introduced in Section 
3.1, that is the problem of exploration. Whereas hallucination produces images given a 
representation; exploration produces a representation given images. The goal of explo- 
ration is to actively control the data acquisition process in such a way that aggregating 
Actionable Information eventually yields a complete representation. As we acquire 
more and more data, the hope is that the set of representations that are compatible with 
the data shrinks, although not necessarily to a singleton (a complete representation is 
not necessarily unique). In other words, we hope that the exploration process will re- 
duce the uncertainty on the representation. When and if the inferred representation is 
complete, it can synthesize the light-field of the original scene up to the uncertainty of 
the sensors, and we say we have performed sufficient exploration. Exploration can be 
more or less efficient, in the sense that sufficiency can be achieved with a varying cost 
of resources (e.g. time, energy). 

Exploration is the process that links maximal invariants, (j) A (I), whose complexity 
we called Actionable Information (AI), to minimal sufficient statistics of a complete 
representation, whose complexity we called complete information (CI). As we have 
noted, because of non-invertible nuisances, the gap between AI and CI can be filled 
by exercising some form of control on the sensing process. Such control could be 
exercised in data space, for instance by choosing the most informative features, or in 
sensor space, for instance by selecting sensing assets in a sensor network depending on 
visibility, or in physical space, for instance by moving around an occlusion or moving 
closer to an unresolved region of the scene. 

We start by designing a myopic explorer, driven simply by the maximization of 
Actionable Information. Ideally, by accumulating AI, one would hope to converge to 
the CI. Unfortunately, this is in general not the case for such a myopic explorer. We 
therefore consider a slightly less primitive explorer seeking to maximize its reward 
over a receding horizon. Again, in general, this strategy is not guaranteed to achieve 
complete efficient exploration. Thus, the modeling exercise points to the need to endow 
the explorer with memory that can summarize in finite complexity the results of the 
exploration so far. Such a memory will be precisely the representation we are after, 
and we discuss inference criteria that can drive the building of a representation. In 
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some simple instances, such criteria yield viable computational schemes. 

In all cases, occlusions and scaling/quantization play a critical role in exploration, 
a process that can be thought of as the inversion of such nuisances. Therefore, at 
the outset, we will need efficient methods to perform occlusion detection, to be used 
either instantaneously - by comparing two temporally adjacent images, as in myopic 
memoryless exploration - or incrementally, by comparing each new image with the 
one hallucinated from the current representation. 



8.1 Occlusion detection 
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Figure 8.1: Motion estimates for the Flower Garden sequence (left), residual e (center), 
and occluded region (right) (courtesy of [6]). 

In this section we summarize the results of [6], where it is shown that occlusion 
detection can be formulated as a variational optimization problem and relaxed to a 
convex optimization, that yields a globally optimal solution. We do not delve on the 
implementation of the solution, for which the reader is referred to [ ], but we describe 
the formalization of the problem. The reader uninterested in the specifics can skip the 
rest of this section and assume that the region Q(t) cDcl 2 that is not co-visible in 
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two adjacent images has been inferred. Note that this region is not necessarily simply 
connected or regular, as Figure 8.2 illustrates. 

Remark 11 (Optical flow and motion field) Optical flow refers to the deformation OPTICAL FLOW 

of the domain of an image that results from ego- or scene motion. It is, in general, 
different from the motion field, (2.9), that is the projection onto the image plane of 
the spatial velocity of the scene [200], unless three conditions are satisfied: Lamber- 
tian reflection, constant illumination, and co -visibility. We have already adopted the 
Lambertian assumption in Section 2.2, and it is true that most surfaces with benign 
reflectance properties (diffuse/ specular) can be approximated as Lambertian almost 
everywhere under diffuse or sparse illuminants (e.g. the sun). In any case, widespread 
violation of Lambertian reflection does not enable correspondence [ ], so we will 
embrace it like the majority of existing approaches to motion estimation. Similarly, 
constant illumination is a reasonable assumption for ego-motion (the scene is not mov- 
ing relative to the light source), and even for objects moving (slowly) relative to the 
light source. Co-visibility is the third and crucial assumption that has to do with occlu- 
sions and is neglected by the majority of approaches to infer optical flow. If an image 
contains portions of its domain that are not visible in another image, these can patently 
not be mapped onto it by optical flow vectors. Constant visibility is often assumed be- 
cause optical flow is defined in the limit where two images are sampled infinite simally 
close in time, in which case there are no occluded regions, and one can focus solely on 
discontinuities of the motion field. But the problem is not that optical flow is discontin- 
uous in the occluded regions; it is simply not defined; it does not exist. By definition, an 
occluded region cannot be explained by a displaced portion of a different image, since 
it is not visible in the latter. Motion in occluded regions can be hallucinated ( extrapo- 
lated, or "inpainted") but not validated on data. Thus, the great majority of variational 
motion estimation approaches provide an estimate of a dense flow field, defined at each 
location on the image domain, including occluded regions. In their defense, it can be 
argued that even if we do not take the limit, for small parallax ( slow-enough motion, 
or far-enough objects, or fast-enough temporal sampling) occluded areas are small. 
However, small does not mean unimportant, as occlusions are critical to perception 
[66] and a key for developing representations for recognition. 

Remark 12 (The Aperture problem) The aperture problem refers to the fact that the 
motion field at a point can only be determined in the direction of the gradient of the 
image at that point ( Figure 3. 7). Therefore, occlusions can only be determined if the 
gradient of the radiance intersects transversally the boundary of the occluded region. 
When this does not happen (for instance when a region with constant radiance occludes 
another one with constant radiance), an occlusion cannot be positively discerned from 
a material boundary or an illumination boundary. For instance, in the Flower Gar- 
den sequence (Figure 8.1), the (uniform) sky could be attached to the branches of the 
tree, or it could be occluded by them. Often, priors or regularizes are imposed to re- 
solve this ambiguity, but again such c choice is not validated from the data. We prefer 
to maintain the ambiguities in the representation, deferring the decision to the explo- 
ration process, when data becomes available to disambiguate the solution (e.g. the tree 
passing in front of a cloud). This naturally occurs over time as the initial representation 
converges towards the complete representation. 
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LAGRANGE MULTIPLIER 



For clarity, we denote a time- varying image via / : D C R 2 x R + — >> R + ; (x, t) H> 
I(x, t). Under the Lambertian assumption, as we have seen in Section 5.1, the relation 
between two consecutive images in a video {I(x, t)}f =0 is given by 

I(x, t + dt) = J /( ^' ^ + n(x ' xe D \ Q ^ dt) (8 1) 

1 v(x, t\ x G ft(t\ dt) 

where w : D x R + — >• R 2 is the optical flow 1 (2.10) which approximates the motion 
field (2.9) and is defined everywhere except at occluded regions ft. These can change 
over time and depend on the temporal sampling interval dt; ft is in general not simply- 
connected, so even if we call ft the occluded region (singular), it is understood that it is 
made of multiple connected components. 2 In the occluded region, the image can take 
any value v : ft x R + — >> R + that is in general unrelated to I(x,i)\ n . Because of 
(almost-every where) continuity of the scene and its motion (i), and because the additive 
term n(x,t) compounds the effects of a large number of independent phenomena and 
therefore we can invoke the Law of Large Numbers (ii), in general we have that 

(i) lim ft(t; dt) = 0, and (ii) n H ~ A/"(0, A) (8.2) 

dt— >0 

i.e. , the additive uncertainty is normally distributed in space and time with an isotropic 
and small variance A > 0. Note that we are using the (overloaded 3 ) symbol A for the 
covariance of the noise, whereas usually we have employed the symbol a. The reason 
for this choice will become clear in Equation (8.11), where A will be interpreted as a 
Lagrange multiplier. We define the residual 

e : D R + ; (x, t) H> e(x, t; dt) = I(x, t + dt) - I(w(x, t),t) 

on the entire image domain x G D, via (for simplicity we omit the arguments of the 
functions when clear from the context) 

e(x,t;dt)=I(x,t + dt)-I(w(x,t),t) = h X ' t } ^f^" _ (8.3) 

I v(x,t) — l(w(x,t),t) x E it 

which we can write as the sum of two terms, e\ : D — >> R + and e2 : D — >• R + , also 
defined on the entire domain D in such a way that 

feiO,t;dt) = v(x,t) - I(w(x,t),t), x G ft 
1 e2(x, t\ dt) = n(x, t), x G D\ft. 



Sometimes the deformation field w is represented in the form w(x) = x + v(x) and the term "optical 
flow" refers to v(x). The two representations are equivalent, for one can obtain w from v as above, and v 
from w via v(x) = w(x) — x. Therefore, we will not make a distinction between the two here, and simply 
call w the optical flow. 

2 Multiple connected component means that each component is a compact simply connected set, that is 
disjoint from other components, it does not mean that the individual components of the occluded domain fl 
are connected to each other. 

3 We had already used A to indicate the loss function in Section 2.1, but that should cause no confusion 
here. 
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Note that C2 is undefined in Q, and e\ is undefined in D\Q, in the sense that they can 
take any value there, including zero, which we will assume henceforth. We can then 
write, for any xGD, 

/(x, t + dt) = I(w(x, t),t) + t; (it) + e2(x, t; dt) (8.5) 

and notice that, because of (i), e\ is large but sparse, 4 while because of (ii) 62 is small 
but dense 5 . We will use this as an inference criterion for w, seeking to optimize a data 
fidelity criterion that minimizes the L° norm of ei (a proxy of the area of Q), and the 
log-likelihood of n (the Mahalanobis norm relative to the variance parameter A) 

^data(w,ei) = ||ei|| L o (jD) + -||e 2 ||L2(D) subject to (8.5) (8.6) 
-\\I(x,t + dt) -I(w(x,t),t) -ei|| L 2 (D) + ||ei|| L i( D) 



where 
and II ; 



A 

\l°(d) 



= / Xf(x)^o(x)dx is relaxed as usual to ||/||li(d) = f D \f(x)\dx 
l 2 {d) = J D \f(x)\ 2 dx. Since we wish to make the cost of an occlusion in- 
dependent of the brightness of the (un-)occluded region, we can replace the L 1 norm 
with a re-weighted i 1 norm that provides a better approximation of the L° norm [7]. 

Unfortunately, we do not know anything about e\ other than the fact that it is sparse, 
and that what we are looking for is xn ei, where x : D — > M + is the characteristic 
function that is non-zero when x G ft, i.e. , where the occlusion residual is non-zero. 
So, the data fidelity term depends on w but also on the characteristic function of the 
occlusion domain e\. Using the formal (differential) notation for first-order differences 
(note that this is just a symbolic notation, as we do not let dt — >• 0, lest we would have 

n -> 0) 



VI(x,t) 



I (x + 


' 1 " 






-I{x,t) 


l(x + 


" ' 

1 




-I{x,t) 



I t (x,t) = I(x,t + dt) — I(x,t) 
v{x, t) = w{x, t) — X 
we can approximate, for any x G D\ft, 

I(x, t + dt) = I(x, t) + V/(x, t)v(x, t) + n(x, t) 

where the linearization error has been incorporated into the uncertainty term n(x,t). 
Therefore, following the same previous steps, we have 



(8.7) 

(8.8) 
(8.9) 

(8.10) 



^data(^ei) 



\VIv + I t - e^dx + \ / \ei\dx 

D JD 



(8.11) 



4 In the limit dt — > 0, "sparse" stands for almost everywhere zero; in practice, for a finite temporal 
sampling dt > 0, this means that the area of fl is small relative to D. Similarly, "dense" stands for almost 
everywhere non-zero. 

5 In the limit dt — > 0, "sparse" stands for almost everywhere zero; in practice, for a finite temporal 
sampling dt > 0, this means that the area of fl is small relative to D. Similarly, "dense" stands for almost 
everywhere non-zero. 
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Since we typically do not know the variance A of the process n, we will treat it as 
a tuning parameter, and because ^data or A^data yield the same minimizer, we have 
attributed the multiplier A to the second term. In addition to the data term, because the 
unknown v is infinite-dimensional, we need to impose regularization, for instance by 
requiring that the total variation (TV) of v be small 



^reg(^) =fi I \\Vv\\dx (8.12) 
JD 

where /i is a multiplier to weigh the strength of the regularizer. TV is desirable in the 
context of occlusion detection because it does not excessively penalize motion discon- 
tinuities, that can occur where motion is parallel to the occluding boundary (so there is 
an occluding boundary without an occluded region, for instance where the direction of 
motion is parallel to the occluding boundary, see Remark 12). The overall problem can 
then be written as the minimization of the cost functional = ^data + ^reg> which in 
a discretized domain (the lattice DflZ 2 ) becomes 



«, ei = argmin \\VIv + I t - e\\% + A||e|| £ i + Vv\\ £ i (8.13) 

v,e s ™ / 

v 

i/j(v,e) 



where £j, w is the re-weighted i 1 norm [7] and i 1 , £ 2 are the finite-dimensional version 
of the functional norms L 1 (D) , L 2 (D) . The remarkable thing about this minimization 
problem is that ip is convex in the unknowns v and e. Therefore, convergence to a global 
solution can be guaranteed for an iterative algorithm regardless of initial conditions 
[27]. This follows immediately after noticing that the £ 2 term 



[V/, -Id] 



h y (8.14) 



is linear in both v and e, the gradient (first-difference) Vv is a linear operator, and the 
i 1 norm is a convex function. 

Not only is this result appealing on analytical grounds, but is also enables the de- 
ployment of a vast arsenal of efficient optimization schemes, as proposed in [ ]. In 
practice, although the global minimizer does not depend on the initial conditions, it 
does depend on the multipliers A and /i, for which no "right choice" exists, so an em- 
pirical evaluation of the scheme (8.13) is necessary [6]. 

It is tempting to impose some type of regularization on the occluded region Q (for 
instance that its boundary be "short"), but we have refrained from doing so, for good 
reasons. In the presence of smooth motions, the occluded region is guaranteed to be 
small, but under no realistic circumstances is it guaranteed to be simple. For instance, 
walking in front of a barren tree in the winter, the occluded region is a very complex, 
multiply-connected, highly irregular region that would be very poorly approximated by 
some compact regularized blob (Figure 8.2). 

Source code to perform simultaneous motion estimation and occlusion detection is 
available on-line from [6]. It can be used as a seed to detect detached objects, or better 
detachable objects [8], or to temporally integrate occlusion information so as to infer 
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Figure 8.2: Occlusion regions for natural scenes (left) can have rather complex struc- 
ture (middle) that would be destroyed if one were to use generic regularizers for the oc- 
cluded region The disparity maps (right) should only be evaluated in the co-visible 
regions (black regions in the middle figure), since disparity cannot be ascertained in 
the occluded region. 

depth layers as done in [87, 86]. We will, however, defer the temporal integration of 
occlusion information to the next sections, where we discuss the exploration process. 
For now, what matters is that it is possible, given either two adjacent images, I tj It+i, 
or an image predicted via hallucination, for instance and a real image It+i, to 
determine both the deformation taking one image onto the other in the co-visible region, 
and to determine the complement of such a region, that is the occluded domain. 

In the next section we begin discussing the implication of occlusion detection for 
exploration. 

8.2 Myopic exploration 

Consider an image I t , taken at time t, and imagine using it to predict the next image, 
It+i- In the next section we describe this process in the absence of non-invertible 
nuisances. We show that in this case one image is sufficient to determine a complete 
representation, and there is no need for exploration at all. The story is, of course, 
different when there are invertible nuisances. 

8.2.1 In the absence of non-invertible nuisances 

In the absence of occlusions (for instance, if the images are taken by an omni-directional 
camera inside an empty room) and quantization errors or noise (if the images are sam- 
ples of the scene radiance that satisfy Nyquist's condition - if that was possible, that 
is), and if the scene obeys the Lambert- Ambient- Static model (2.6), the new image 
provides no new "information," in the sense that it can be generated by the old image 
via a suitable contrast transformation of its range and diffeomorphic deformation of its 
domain. In other words, the first image I t can be used to construct a representation as 
we have discussed in Section 2.4.4, which in turn can be used to hallucinate any other 
image of the same scene, for instance Jt+i. In fact, from (2.6), we can solve for p in 
the first image via p(p(x)) = It(x), where p(x) = x/\\x\\ and substitute in the second, 
so we have 

I t +i(w(x)) = koI t (x) (8.15) 
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Figure 8.3: Detachable object detection (courtesy of [8]). The "girl" in Figure 1.2 
and the soccer player would be similarly classified as bona-fide "objects" by a passive 
classification/detection/segmentation scheme. However, an extended temporal obser- 
vation during either ego-motion (top) or object motion (bottom) correctly classifies the 
car on top and the soccer player on the bottom as "detached objects" but not the girl 
painted on the road pavement (middle). 



where w(x) = 7r^7r _1 (x/||x||) and where we have aggregated the two contrast trans- 
formations k t +dt and k t into one. This is exactly the hallucination process described 
in Section 3.1, whereby a single image I t , supported on a plane, is interpreted as a 
representation and used to hallucinate the next image It+i> 

In the absence of non-invertible nuisances, the only purpose of the second image is 
to enable the estimation of the domain deformation w and the contrast transformation 
k through equation (8.15) [90]. So, the first image captures the radiance p, and in com- 
bination with the second image they capture the shape S, entangled with the nuisance 
g in the domain deformation w, as discussed in Section 6.2 and further elaborated in 
B.3. Any additional image I r , not necessarily taken at an adjacent time instant, can 
be explained by these two images, and does not provide any additional "information" 
on the scene. In fact, any additional image can be used to test the hypothesis that it is 
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generated by the same scene, by performing the same exercise with either of the two 
(training) images, to test whether they are compatible with (i.e. can be generated by) 
the same scene, except for the residual error n that is temporally and spatially white, 
and isotropic (homoscedastic). 6 This test consists of an extremization procedure (Sec- 
tion 2.3 and Figure 2.5), whereby nuisance factors, as well as the scene, are inferred 
as part of the decision process, and the residual has high probability under the noise 
model. This can be thought of as a "recognition by (implicit) reconstruction" process, 
and was common in the early days of visual recognition via template matching. 

Notice that, as discussed in Section 6.2, contrast (a nuisance) interacts with the radi- 
ance, so one can only determine the radiance up to a contrast transformation. Similarly, 
motion (a nuisance) interacts with the shape/deformation, so one can only determine 
shape up to a diffeomorphism. If one were to canonize both contrast and domain dif- 
feomorphisms, then some discriminative power would be lost as any two scenes that 
are equivalent (modulo a domain diffeomorphism, whether or not they are generated 
by a scene with the same shape) would be indistinguishable from a maximal invariant 
feature (or from any invariant feature for that matter) [197]. 

In summary, in the absence of non-invertible nuisances, a single (omni-directional, 
infinite-resolution, noise-free) image can be used to construct a representation from 
which the entire light field of the scene can be hallucinated. This is what we called a 
complete representation. Its minimal sufficient statistic is the Complete Information, 
and there is no need for exploration. The need for exploration does, of course, arise 
as soon as non-invertible nuisances are present. The case discussed above is a wild 
idealization since, even in the absence of occlusions, quantization and noise introduce 
uncertainty in the process, and therefore there is always a benefit from acquiring addi- 
tional data, even in the absence of any active control. 

8.2.2 Dealing with non-invertible nuisances 

In general, equation (8.15) is valid only in the co-visible domain (Section 2.2). Ne- 
glecting contrast transformations for simplicity, we have that 

I t+1 (w(x)) = I t (x), x G DC) w~ 1 (w(D)) c D. (8.16) 

The complement of the co-visible domain is the occluded domain 

n = D\w-\w(D)) C D. (8.17) 

Since we will assume that there is a temporal ordering (causality), we will consider 
only forward occlusions (sometimes called "un-occlusion" or "discovery"); that is, 
portions of the domain of lt+i> ^ C D, whose pre-image w~ l (Vt) was not visible in 
I t . The restriction of the image to this subset, {I t +±(x), x G Q} cannot be explained 
using I t . 

Consider now an image I t , and imagine using it to predict the next image It+i, as 
we have done in the previous section. No matter how we choose w, we cannot predict 



6 This kind of geometric validation is used routinely to test sparse features for compatibility with an 
underlying epipolar geometry [125]. 
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exactly what the new image is going to look like in the occluded domain Q, even if 
we had an omni-directional, infinite-resolution, noise-free camera. Therefore, in the 
occluded domain we have a distribution of possible "next images," based on the priors. 
The entropy of this distribution measures the uncertainty in the next image, and there- 
fore the potential "information gain" from measuring it (before we actually measure 
it). Once we discount other nuisance factors, by considering a maximal invariant of the 
next image in the occluded domain, we have the innovation 

e(/,t + l)=</> A (/ t+1| J (8.18) 

whose complexity represents the Actionable Information Increment (AIN): 

AIN(I,t) = H{e{I,t + 1)) = W(/t+i| n ). (8.19) 

The AIN is the uncertainty in the innovation, or the "degree of unpredictability" of 
future data. The above is the contribution to the AIN due to occlusions, that can be de- 
termined as in Section 8.1. However, we also have uncertainty due to quantization and 
noise. 7 This is simpler to model, as it is usually independent of the scene: Uncertainty 
due to scaling and quantization increases linearly with distance, and uncertainty due 
to noise is independent of both the scene and the viewer. In all cases, however, there 
exists a control action that can reduce uncertainty. For the case of scale, it is zooming, 
or translating along the optical axis, or increasing the sensor's resolution. For the case 
of noise, it is taking repeated (registered) measurements. 

In all cases, what yields a non-zero innovation is the presence of non-invertible 
nuisances, v t , that include occlusion, quantization, and other unmodeled uncertainty 
(noise). We then have that the AIN can be computed as the conditional entropy of 

p(^(I t+1 )\I t ) = J p{^{I t+1 )\I u v t )dP{v t ) (8.20) 

where we have marginalized over all non-invertible nuisances. We therefore have, for 
the specific case of just two images, 

AIN{I,t)=H{I t+1 \I t ). (8.21) 

This construction can be extended to the case where we have measured multiple images 
up to time t, as we discuss in the next section. 

Remark 13 (Background subtraction) The innovation can be thought of as a form 
of generalized background subtraction. In the most trivial case, the representation is 
derived from one image only (the background image) and the only motion in the image 
is due to a 'foreground object" Other background subtraction schemes, including 
those based on multiple layers or on an aggregate model of the background [106], can 
be understood in the framework of occlusion detection and innovation under different 
prediction models [87, 86]. 

7 Quantization can also be thought of as a form of occlusion mechanism, since details of the radiance that 
exist at a scale smaller than the back-projection of the area of a pixel onto the scene are "hidden" from the 
measurement. Rather than moving around an occluder, one can "undo" quantization by moving closer to the 
scene. 
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8.2.3 Enter the control 

Clearly, the AIN depends on the motion between t and t + dt, as well as on the 
(unknown) scene £. For a sufficiently small dt, gt+dt — exp (udt)g t , where u G se(3) 
is the rigid body velocity of the viewer [125] (assumed constant between t and t + dt), 
the operator ^ is the "hat" operator that maps a 6-dimensional vector of rotational 
and translational velocities into a "twist" [125] and se(3) denotes the Lie algebra of 
"twists" associated with SE(3) [125]. Thus, the motion between t and t + dt is given 
by gt+dtgt 1 — Id+udt, which one can control by moving the sensor with a velocity u. 
By assuming velocity to be sufficiently small, we can take dt to be the unit time interval, 
dt = 1. Therefore, the AIN can, to a certain extent, be controlled. When emphasizing 
the dependency on the control input, we write AIN(I,t;u) = T-L(It+i\It,u t ). A 
myopic controller would simply try, at each instant, to perform a control action (e.g. a 
rigid motion, zooming, or capturing and averaging multiple images) so as to maximize 
the AIN: 

u t = arg max AIN (I, t; u) . (8.22) 

u 

Note that the AIN is computed before the control u t is applied, and therefore before the 
"next image" I t +i is computed. Thus the AIN involves a distribution of next images, 
marginalized with respect to all non-invertible nuisances. Once we actually measure 
the next image, we sample an instance of the innovation process. If we have discovered 
nothing in the next image, the sample of the innovation will be zero. 

Therefore, a myopic agent could easily be stuck in a situation where none of the 
allowable control actions Ut EU yield any information gain. The observer would then 
stop despite having failed to attain the Complete Information. The observer could also 
be trapped in a loop where it keeps discovering portions of the scene it has seen before, 
albeit not in the immediate past, for instance as it moves around a column. This is 
because the controller (8.22) does not have memory. 

To endow the controller with memory we can simply consider the innovation rel- 
ative to the history I 1 = Jq = {/ r }^ =0 , rather than relative to just the current image 
It. 

u t = argmaxH(/t+i|/ t , u) (8.23) 

u 

There are two problems with this approach: One is that it is still myopic. The other is 
that the history P grows unbounded. 

8.2.4 Receding-horizon explorer 

The myopic approach in the previous section is essentially performing a greedy search, 
a process that yields no provable guarantees unless the underlying functional being 
maximized is convex (or, in the discrete case, submodular [ ]). A slight improve- 
ment can be had by planning a controller that maximizes the AIN over a finite horizon, 
for instance of length T > 1 : 

u t = argmax / H(4+ T |J t ,ii). (8.24) 

u 

Of course, as soon as the agent makes one step, and acquires It+i, the history changes 
to I t+1 ; therefore, the agent can use the control planned for the T-long interval to 
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perform the first step, and then re-plan the control for the new horizon from t + 1 to 
t + T + 1. One can also consider an infinite horizon T — oo, which presents obvious 
computational challenges. Regardless of the horizon, however, all controllers above 
have to cope with the fact that the history keeps growing. Even setting aside the fact 
that no provable guarantees of sufficient exploration exist for this strategy, it is clear that 
the agent needs to have a finite-complexity memory that summarizes all prior history. 



8.2.5 Memory and representation 

A memory is a statistic of past data, 0(7* ). Of all statistics, we are interested in those 
that are parsimonious (minimal), and "informative" in the sense that they can generate 
copies not of the data, but of the maximal invariants of the data. These are the char- 
acteristics of what we have defined as a representation. At any given time t, we can 
infer a representation £ £ that is compatible with the history P and that, eventually, may 
converge to a complete representation £. Building such a representation (memory) is 
an inference problem that can be framed in the context of exploration as a System Iden- 
tification problem under a suitable prediction-error criterion [121]. Given a collection 
of data P, we are interested in determining a "model" <p such that the statistic (j)(P) 
best summarizes the entire past P for the purpose of predicting the entire future 
In Remark 14 we will shows that this is equivalent to determining 

it = axgmin subject to P e £(6). (8.25) 

In prediction-error methods, usually the complexity constraint is enforced by choosing 
the bound on the order of the model (the number of free parameters), and the entropy is 
approximated with its empirical estimate, assuming stationarity. For the case where all 
the densities at play are Gaussian, minimizing the entropy above reduces to minimizing 
the sum of squared one-step prediction errors. This problem has also been addressed in 
the literature of information-based filtering [137], but can only be realistically solved 
for low-dimensional state-spaces [161] using particle filtering, or for linear-Gaussian 
models (although see [75]). 

Note that the condition P G C(£t) is t0 be understood up to the noise in the mea- 
surement device n t , according to (2.18), in the sense that the residual I t — h(g, £ t , v) 
has high probability under the noise model p n for all times t. Such a "noise" lumps the 
residual due to all unmodeled phenomena and can be model as spatially and temporally 
white, and isotropic (homoscedastic). If not, the spatial or temporal structure can be 
included as part of the representation, until the residual is indeed white. The above 
equation can then be interpreted as a reduction of uncertainty, specified by the mutual 
information P), as we have anticipated in Section 2.5. 

In a causal exploration process, where data is gathered incrementally, we would like 
to update the estimate of the representation; by minimizing the conditional entropy 
above. At the same time, we would like to control the exploration process so as to 
maximize the AIN. This is discussed in the next section where we describe dynamic 
explorers. 
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8.3 Dynamic explorers 

In this section we consider an entity including a sensor (a video camera) capable of 
exercising a control, for instance by changing the vantage point or the characteristics 
of the sensor (zooming). We call such an entity an explorer. In the absence of any 
restriction on the control, so long as there are non-invertible nuisances, the AIN pro- 
vides guidance to perform exploration, and memory provides a way to incrementally 
build a representation £. Ideally, such a representation would eventually converge to a 
minimal sufficient statistic of the light field, £(£), thus accomplishing sufficient explo- 
ration. Sufficient exploration, if possible, would enable the absence of evidence to be 
taken as evidence of absence [153]. 

We start with one image, say Iq, and its maximal invariant (j) A (Io), that is com- 
patible with a (typically infinite) number of scenes, £o> that are distributed according 
to some prior p(£o) that has high entropy (uncertainty). Compatibility means that the 
image Io can be synthesized from f o up to the modeling residual n. As we have seen in 
Section 3.1, we can easily construct a sample scene by choosing So to be a unit sphere, 
Sq(x) = p(x) = p-n , and choosing po(p) = I(x) where p = p{x) in the entire sphere 

but for a set of measure zero. Therefore, we call fo = {po? ^o} = Clearly, if 

we had only one image, as we did in Section 3.1, we would not need a representation 
to explain it, and indeed we do not even need a notion of scene. So, in our case, this 
is just the initialization of the explorer, which corresponds to one of many possible 
representations that are compatible with the first image (see Figure 3.1). 

Given the initial distribution, p(£o) an d the next measurement, Ii, we can compute 
an innovation e\ = I\ — h(gi, fo> ^i), that is a stochastic process that depends on the 
(unknown) nuisances v\ as well as on the distribution of fo- We can then update 
such a distribution by requiring that it minimizes the uncertainty of the innovation. The 
uncertainty in the innovation is the same as the uncertainty of the next measurement, 
and therefore we can simply minimize the conditional entropy of the next datum, once 
we have marginalized or max-outed the nuisances: fi = argmin^ H(Ii\£o) where 

p(h\to) = / A/"(/i — h(g, £o> v))dP(g)dP(v). In practice, carrying around the entire 
distribution of representations is a tall order, for the space of shape and reflectance 
functions do not even admit a natural metric, let alone a probabilistic structure that 
is computable. So, one may be interested in a point- estimate, for instance one of the 
(multiple) modes of the distribution, the mean, median, or a set of samples. In any 
case, we indicate this via fi = argmm^^ H(/i|fo)- Now, instead of marginalizing 
over all (invertible and non-invertible) nuisances, we can canonize the invertible ones, 
and therefore, correspondingly, minimize the actionable information gap, instead of 
the conditional entropy of the raw data: At a generic time t, assuming we are given 
p(^t-i), we can perform an update of the representation via 



& = argmin%(/ £ |6-i) 



= arg min H 



J AT^ A (I t )-h(i t _ 1: iy)-a)dP(iy) 



(8.26) 

where o) denotes a Gaussian density with mean fi and isotropic standard devia- 
tion a. To the equation above we must add a complexity cost, for instance \H(£ t -i) 
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where A is a positive multiplier. 

Now, at time t, the updated representation £ t can be used to extrapolate the next 
image It+i. The hallucination process carries some uncertainty because of the non- 
invertible nuisances, and this uncertainty is precisely the information gain to come 
from the next image. Since the next image depends on where we have moved (or, more 
in general, on what control action we have exercised), we can choose the control u t so 
that the next image I t +\ will be most informative, i.e. 



ut = E£gmaxH(It+i\it,u) 

(8.27) 

where U includes complexity or energy costs associated with the control action u. 
Thus, inference and control are working together, one to maximize the uncertainty of 
the next data, the other to minimize it. 

Technically, the most informative next data would be the ones that produce the 
largest reduction in uncertainty, i.e. argmaxll(£; I t+l ) = H(£) — H(^\I t+1 ). Unfor- 
tunately, as we have already discussed, even defining a base measure in the space of 
scenes is difficult, so H(£) is problematic. However, we can rewrite the mutual infor- 
mation above in terms of the entropy of the data, H(I t +i u) — H(I t +i \£,u). Now, 
given the scene £, even the non-invertible nuisances v are invertible, so p(I t+ i\£, u) 
only depends on the residual uncertainty, that comes from all unmodeled phenomena 
and can realistically be considered white, independent, and isotropic (otherwise, if it 
has considerable structure, this should be modeled explicitly as a nuisance v). There- 
fore, the conditional entropy H(I t+ i\^ : u) is independent of u, and we can focus on 
the first term H(I t +i \P , u), which is what we have done in (8.27) after handling the 
invertible nuisance via pre-processing with <p A . 

The construction of a representation from a collection of data treated as a batch 
has been described in [87, 86] for the case of multiple occlusion layers portraying 
arbitrarily deforming objects, and in [90] for the case of rigid objects generating self- 
occlusions. This requires the dynamic update of the visible portion of the domain as 
a result of the update of the representation. In the case of [ 3], the representation 
consisted of an explicit model of the geometry of the scene (a collection of piece- 
wise smooth surfaces) as well as of the photometry of the scene, consisting of the 
radiant tensor field, that could be used to generate "super-resolution images" (i.e. im- 
ages hallucinated at a resolution higher than that of the native sensor that collected the 
original data), as well as to generate views from different vantage points despite non- 
Lambertian reflection. In all these cases, the uncertainty was assumed Gaussian, so 
minimum-entropy estimation reduces to wide- sense filtering (minimum- variance). 

The design of a control action, given the current representation, for the case of 
uncertainty due to visibility has been described in [192, 193] for compact spaces, and 
extended in [ ] for unbounded domains. The case where uncertainty due to scale and 
noise is also present has been described in [99]. 

Note that, in general, there is no guarantee that | t — » £ in any meaningful sense. 
The most we can hope for is that £(£t) — » £(£)> as we have pointed out. In other words, 
what we can hope is that our representation can at most generate data that is indistin- 
guishable from the real data, up to the characteristics of the sensor. Since the inference 



argmaxtf (j N(^(I t+1 ) - h(i u v)-(j)dP(v) 
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of a representation is guided by a control, and the hallucination process requires the 
simulation of the invertible nuisance g (which includes vantage point), Koenderink's 
famous characterization of images as "controlled hallucinations" is particularly fitting. 
Following the analysis above we can say that the representation £ is obtained through 
a controlled exploration (perception) process, and from the representation we can then 
hallucinate images in a controlled fashion. 

In general, however, it is not possible to guarantee that the exploration process will 
converge, even in the sense of (2.18) £(&) —> £(£)• However, it is trivial to design 
exhaustive control policies that, under suitably benign assumptions on the environment 
(that the scene is bounded, that the topology is trivial, and that the radiance obeys 
some sparsity or band-limited assumption) will achieve sufficient exploration, at least 
asymptotically: 

it^i s.t. £(£)=£(£)• (8.28) 

For instance, a Brownian motion restricted to the traversable space will, eventually, 
achieve complete exploration of a static environment. 

The goal of exploration is, therefore, to trade off the efficiency of the exploration 
process, including the cost of computing an approximation to the policy (8.26)-(8.27), 
with the probability of achieving sufficient exploration. This is beyond the scope of 
this manuscript and we refer the reader to the vast literature on Optimal Control, Path 
Planning, Robotic Exploration, and partially-observable Markov Decision Processes 
(POMDP) in Artificial Intelligence. For the case of uncertainty due to visibility (occlu- 
sions) both [79, 192] provide bounds on the expected path length as a function of the 
complexity of the environment. 

The discussion above suffices to our purpose of closing the circle on the issue of 
representation introduced in Section 2.4.4 and discussed in Section 3.1, by providing 
means to approximate it asymptotically from measured data. At any given instant of 
time, our representation £ t is incomplete, and any discrepancy between the observed 
images and the images hallucinated by the representation (the innovation) can be used 
to update the representation and reduce the uncertainty in £. 

The fact that, to infer a representation, a control u must be exercised, links the 
notion of representation (and therefore of information) inextricably to a control action. 
More precisely, the exploration process links the control to the representation, and 
actionable information to the complete information. This discussion, of course, only 
pertains to the limiting case where we have arbitrary control authority to move in free 
space, into every nook and cranny (to invert occlusions), to zoom-in or move arbitrarily 
close and have arbitrarily high sampling rate (image resolution, to invert quantization), 
and to stay arbitrarily long in front of a static scene (or to sample in time arbitrarily fast 
relative to the time constant of the temporal changes in the scene), to invert noise. The 
opposite extremum is when we have no control authority whatsoever, in which case all 
we can compute is a maximal invariant and its actionable information, as discussed in 
Section 2.5. In this case the Actionable Information Gap cannot be closed. 8 

In between, we can have scenarios whereby a limited control authority can afford 
us an increase in actionable information by aggregating the AIN over time, thus tak- 



8 In addition to mobility, another active sensing modality can be employed, for instance by controlling 
accommodation) or by flooding the space with a controlled signal and measuring the return, as in Radar. 
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ing us closer to the complete information. Therefore, we can think of the "degree of 
invertibility" of the nuisance, properly defined, as a proxy of recognition performance. 
This we do in Section 8.4. 

This chapter concludes the treatment of active exploration. In the next chapter we 
turn our attention to constructing models of not individual objects, but object classes 
with some intra-class variability. 

Remark 14 (Information Bottleneck) The functional optimization problem (8.26) is 
closely related to the Information Bottleneck principle [ ], that prescribes finding 
the representation £ that best trades off complexity and task fidelity by solving 

arg min 1(1*; £) - £I(£; £(£)). (8.29) 

p«|i*) 

Note that the last mutual information is not with respect to the "true " scene £, but to its 
lightfield £(£), which we can think of the collection of all possible images we can take 
from timet + 1 until t = oo, that is, I(£; £(£)) = I(£; Afote that the minimization 

is with respect to the (degenerate) density p^P) = p((/>(/ £ )|/ £ ) = 5(£ — </>(/*)), 
therefore with respect to the unknown "model" <j)(-), from which the "state 9 " £ t can be 
computed via £ t = 0(1*). Using these facts and the properties of Mutual Information 
we have that the above is equal to 

argmintf(0 - ff(£|J<) - pH^i) + Z^+ilO = (8-30) 
= arg min H(I^ ||) + Aif (£) (8.31) 

once we substitute A = \ j ft and recall that H(£\I l ) = since £ = (/>(/ £ ), and 11(1^) 
does not depend on £. Wz^ft ^/z^ underlying processes are stationary and Markovian, 
minimizing if(/^ 1 |£) w equivalent to minimizing H(I t +i\£). 

8.4 Controlled recognition bounds 

In traditional Communication Theory, given sufficient resources (bits), one can make 
the performance in the task (transmission of data) arbitrarily good in the limit, and for 
a given limit on the resources, one can quantify a bound on performance that does not 
depend on the particular signal being transmitted, but only on its distributional proper- 
ties [164]. It would be desirable to have a similar tool for the case of visual decision 
problems, whereby one could quantify performance in a visual decision (detection, 
localization, recognition, categorization etc.), rather than in a transmission task. The 
critical question is what represents the resource, that plays the role of the bit rate in 
communications? What do we need to have "enough of" in order to guarantee a given 

9 The "state" is the statistic the best summarizes the past for the purpose of predicting the future. In 
other words, it "separates" the past from the future. In the linear Gaussian case these sentences have precise 
meanings in terms of Markov splitting subspaces, whereby the state is the projection of the future onto 
the past ILqo* f° r stationary Markovian processes. 
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level of performance in a visual decision task? Enough pixels? Enough visibility? 
Enough views? Enough computing power? In this section, we will see that the criti- 
cal resource for visual decisions is the control authority the viewer has on the sensing 
process. We will argue that, for a passive observer with no control authority whatso- 
ever, there is no amount of pixels or computing power that will suffice to guarantee an 
arbitrarily high level of performance in a visual decision task (Section 8.4.1). In the 
opposite limiting case, we will show that an omnipotent explorer, capable of going ev- 
erywhere and staying there indefinitely, not only can guarantee a bound on the decision 
error, but can make this error arbitrarily small in the limit (Section 8.4.3). In between, 
we will attempt to quantify the amount of control authority and characterize its tradeoff 
with the decision error (Section 8.4.4). 

Some of the results in this chapter may appear obvious to some (of course we can- 
not recognize something we cannot see!), misleading to others, and confusing to others 
yet. Some examples are admittedly straw-men, meant to illustrate the importance of 
mobility for cognition, but we will try to state our assumptions as clearly and unequivo- 
cally as possible, and hopefully what matters will emerge, which is the fact that control 
plays a key role in perception, and that the notion of visual information, which is the 
topic of this manuscript, is the knot that ties them. Of course it is always possible to 
construct specific cases and counter-examples that violate the statements, but our point 
is that these statements are valid on average, once one considers all possible objects and 
all possible scenes. So, if we want to get visual decisions under control, in the sense of 
being able to provide guaranteed bounds on the decision performance, we have to put 
control in visual decisions, in the sense of being able to exercise some kind of control 
authority over the sensing process. 



8.4.1 Passive recognition bounds 

Let us consider again a simple binary decision of Section 2.1 as the prototype of a 
visual decision process. For instance, we may have one or more images of an object 
as a training set, and be asked whether a new test image portrays the same object. 
The average error we make in the decision is quantified by the risk functional, and 
we are interested in whether we can provide a bound, or a guarantee, in the risk, or 
equivalently in the average probability of error. We may have priors that allow us to 
answer this question even in the absence of any data. For instance, if we are interested 
in detecting humans in images from a database, and we are told that 90% of the images 
in the database have a human, then we can perform the task with an average probability 
of error equal to 0.1 just by always deciding that there is a human, c = 1, even without 
looking at the images. However, if by looking at the images we can do no better than 
an average probability of error of 10%, it means that the test images are "useless," they 
" contain no information/' and this would bring our decision strategy into question: 
How is the decision c(I) = 1 V I going to generalize to other objects or other datasets? 
It also raises the question of how the prior P(c — 1) = 0.1 came to be in the first place, 
if the images "contain no information." So, when we say that "the classifier performs 
at chance level" we mean that the classifier attains a risk that is equal to that afforded CHANCE LEVEL 
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by the prior. To simplify these issues, we can just assume that the two hypotheses 10 
c = 1 and c = — 1 are equiprobable, so that P(c = 1) = P(c = —1) = 1/2. 

Now, as we have said in the preamble to this manuscript, visual decisions are diffi- 
cult because of the variability introduced in the data by nuisance factors. We have also 
seen that nuisance factors that have the structure of a group and commute with other 
nuisances are simple to handle and they entail no loss in decision performance (The- 
orem 2). Therefore, to simplify the narration, we reduce the image formation model 
(Section 2.2) to the most essential components, which are scaling, translation, quan- 
tization and occlusion. Indeed, we will even restrict our world to a two-dimensional 
plane ("flatland") where objects are represented by one-dimensional surfaces that are 
piecewise planar that support a scalar- valued radiance function ("cartoons"). 

Remark 15 (Context) Even in the cartoon flatland, it is obvious that the worst-case 
performance in a decision task is at chance level. For instance, given any number of 
pixels, if the object is sufficiently far away, it will project onto an area smaller than 
a pixel, and therefore the image is uninformative. Again, this does not mean that one 
cannot make a decision, or that the decision cannot be made with a small probability of 
error using priors. One can exploit context (a particular form of prior) to infer the pres- 
ence of humans in a scene for instance by detecting a car on the road, and by knowing 
that with high probability a car on the road contains a human. However, the proba- 
bility of error is dictated by the probability of cars containing humans, which cannot 
be validated or verified from the data at hand. So, to render the decision meaningful 
one would have to impose limits, for instance an upper bound on the size of physical 
space being exhamined, and a lower bound on the number of pixels that the object of 
interest occupies. However, the size of physical space depends on the objects, some 
are smaller than others and therefore can only be detected up to a smaller distance. 
Already this exercise is getting dangerously self- referential: In our attempt to arrive at 
bounds in the probability of error in performing a certain task (say detect an object), 
we are imposing bounds that depend on the object itself (which objects we are trying 
to detect). Things get even worse when we consider occlusions. In fact, any object, no 
matter how close, cannot be recognized if it is completely occluded by another object, 
trivially. So, we would also have to impose limits on visibility, for instance by saying 
that a sufficiently large portion of the object has to be visible. But how large depends 
on the object: One can easily recognize a cat by seeing an eye, but one can probably 
not recognize a building by seeing even a large patch of white wall. What is even more 
problematic is the fact that we would have to impose restrictions on viewpoint, which 
is a nuisance. One can see half of an elephant and not be able to recognize it if that 
corresponds to a large grey patch. However, even a tiny portion of the elephant could 
enable recognition if it includes the trunk or tusks. Again, the limits depend on the 
individual objects, as well as on the combination of nuisances that generated the data, 
leading to a tautology whereby "we can recognize the elephant if we can recognize the 
elephant." What we are after, following the model of communication, is a bound that 
works on average for any object, seen under any nuisance. 

10 Here we use c = ±1 instead of c = {0, 1} for convenience since in the analysis we will employ the 
exponential loss instead of the — 1 loss used in Section 2.1. This entails no loss of generality since the 
classifier that minimizes the two risks is the same. 
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In this section, we argue that, for a passive visual observer, once we average the 
performance of even the best possible classifier over all possible objects, seen under 
all possible nuisances, without imposing object-dependent restrictions we have chance 
performance. In other words, even the best classifier performs at chance level (i.e. the 
classifier is, on average, as good as the priors). This has a number of consequences 
when one wishes to use the performance of a certain vision algorithm on a certain 
dataset as indicative of the performance of that algorithm under general conditions (and 
not just for similar objects, in similar poses, in similar scenes, with similar distances, 
under similar occlusion conditions). 



8.4.2 Scale-quantization 

In what follows, for simplicity, we adopt the exponential loss, instead of the symmetric 
— 1 loss. Also, under the cartoon flatland model, the image is just a scaled and 
sampled version of the radiance. 

Let /Gl denote measured data, c £ {±1} denote the class label, and p G H denote a hidden 
variable, which can be infinite-dimensional, in particular p £ {pi, p-i} denote the target class 
representative. Then we denote with c : Z — ^ {±1}; / H> c — c(I) sl classifier (a map from the 
data to the class label), which will be designed to minimize the expected risk. The expected risk 
is defined relative to a loss function A : {±1} x {±1} — »• M + ; (ci, 02) \-± A(ci, C2), which we 
will write with an abuse of notation (exploiting the fact that we have defined c as belonging to 
{±1}) as A(ciC2), as follows: 

R = E p [X(cc(I)] = J X(cc(I))dP(c,I) (8.32) 

where P(c, I) is the joint distribution of the label and the data. The latter is typically unknown 
and approximated empirically using a "training set" (samples {q, U}iLi ~ P(c, I)). In par- 
ticular we consider the exponential loss, defined via a scalar function X e (z) = exp(— z). This 
scalar function is applied to a discriminant, which is a function F : T — »• R defined in such a 
way that the classifier c can be written as c(I) = sign(F(/)). In particular, we have the logistic 
discriminant 

to which there corresponds the exponential risk 

R = E p [exp(-cF(I))] = J exp(-cF(I))dP(c,I). (8.34) 

Note that the classifier (discriminant) F* that minimizes the exponential risk also minimizes 
the — 1 risk, and therefore the two are considered equivalent. The latter has the advantage 
of being differentiable, leaving the discontinuity (sign function) to the last stage of processing 
(the derivation of the classifier from the discriminant). In either case, the computation of the 
classifier, and the corresponding optimal error rate 

R* = E p [exp(-cF* (I))] (8.35) 

rests on the computation of the posterior P(c\I). We will make again the simplifying assumption 
that the priors are equiprobable P(c — +1) = P(c — —1) = |, and therefore the posterior 
ratio (8.33) is identical to the (log-) likelihood ratio 

^)A^(m^)- (8-36) 



EXPONENTIAL LOSS 



2 b \ P {I\ 
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To compute the optimal error rate (8.35), we note that dP(I, c) — \p{I\c — +l)dl+ \p(I\c — 
— l)dl, which also implies that we can use the log-likelihood ratio (8.36) instead of the log- 
posterior. By substituting (8.36) into (8.35) we have 



(8.37) 



which can be evaluated once we substitute an expression for p(I\c) based on the cartoon flatland 
model. 

Ii = p * J\f(sU; (se) 2 ) + n % = j N{t - sU\ (se) 2 )p(t)dt + n % . (8.38) 

Note that the integral with respect to dl is on a finite-dimensional space and can be approximated 
in a Monte Carlo sense given a training set, for 



/ 



f(i)dP(i)~ J2 f(V- ( 8 - 39 > 

ii~dP(i) 



In our hypothesis testing, the presence of an object is the null hypothesis. The alternate hypothe- 
sis is the absence of the object, which corresponds to a distribution that is non-overlapping with 
(8.40) and close to uniform everywhere else (the "background"). 

dP(p\c =1)= U(p - m; a p ) dp(p) = xb^ (p - m)dp(p) (8.40) 

S v ' 

p(p|c=l) 

where xb is the characteristic function of the set B and dp(p) is a base measure (uniform) on 
the entire space X. n Because the space X is infinite-dimensional, a uniform measure would 
correspond to the base measure dp(p), which is improper in the sense that it does not integrate 
to one, as f dp(p) — p(X) = oo. An example of such a background density is 

dP(p\c = -1) = (1 - U(p - m; (J P ))dp{p) (8.41) 

which can be thought of as the limit when M — >> oo of the (proper) measure 

dP(p\c = -1) = c(l - U(p - m; <r p ))N(p\ M) dp(p) (8.42) 
v v ' 

p(p|c=-l) 

where c is a normalization constant. The non-overlapping requirement can be expressed in terms 
of the density functions p(p\c) as 



/ 



p(p\c = l)p(p\c = -l)dp(p) = 0. (8.43) 



We now show that the expected error is at chance level. To this end, we use the score, 
that is the derivative of the log-likelihood with respect to a parameter, which is used as 
a measure of the "sensitivity" of the probabilistic model with respect to the parameter. 

So if p(I\0) is the likelihood and 6 a parameter, then V(p, 0) = log p(I\0). Because of the 
Markov-chain dependency c — > p I, if / becomes increasingly insensitive from p, then it 
also becomes increasingly insensitive on c, and therefore the error becomes closer to 1; in the 
limit, we have 

R* = J yjp(l\c= l)p(I\c= -l)dl = J yjp(l\c) 2 dl = J p(l\c)dl = l. (8.44) 



11 There are some technicalities in the assumptions of absolute continuity of dP with respect to /i that we 
are sweeping under the rug. 
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Unfortunately, the "parameter" of interest in our case is = p, which is infinite-dimensional, 
that introduces a host of conceptual and technical difficulties. However, as we have done in 
Section 6.5, we will exploit the fact that natural radiances are "sparse," in the sense that a wide 
range of statistics exhibit highly kurtotic distributions. Accordingly, one can represent a given 
compact portion of image of a natural scene with a finitely generated model, indeed even a linear 
one 12 , with most of the parameters being non-zero: 

N 

p(t) = Y,h(t)a k =B(t)a teD (8.45) 

k=i 

where, for any given p, the coefficients a±, ct2, • • . are almost-all-zero, or equivalently \ctk\io is 
small. For any given over-complete basis (dictionary) {bk}k=i> and for any N, the distribution 
dP(p) can be written in terms of the distribution of a, dP(p) = p(p\a)dP(a). This adds a link 
to the Markov chain c — »• a —> p —> I. Therefore, it is sufficient to show that any dependency 
in the Markov chain is broken when the scale s — »• oo. By linearity, the convolution operator 
p * Af(sU] (se) 2 ) = B * Af(sti] (se) 2 )a = Ba is represented by multiplication by a (large) 
matrix B so that p(I\a) = Af(I — Ba; a 2 ), then we have that 

^ i (j\ \ d K n + ( ^2^1 ~ B *Af(sti](se) 2 )a 
— logp(Jp) = B*N(sti](se) ) ^ — = (8.46) 

= B^(sU;(se) 2 ) ! _ (B *N(sU; (se) 2 )) 2 q 

(J 2 (J 2 

Now, as s becomes large relative to e, the second term converges to the mean of p, and the first 
term to the mean of J, which is a sampled version of p (but the sampling procedure does not 
change the mean) with added noise (but the noise has zero mean). Therefore, as se — > oo, we 
have that V(p, p) — > 0, and therefore R* — > 1. We summarize this result in the following 
statement: 

Proposition 1 (Passive recognition bounds) The average error in a visual decision 
task for a passive observer is at chance level. 

We have not considered occlusions yet, but one can see that the result above would 
be true a fortiori. One could easily object that this result is a straw-man because we 
cannot see an infinitely large swath of space. We can then try to bound the space, 
and only allow s G [1, s max ) with s max < oo. In this case, one could put a bound 
in the average error for quantization and scaling, but one could still define the clutter 
density, as done in [ ], and show that for a sufficiently large clutter density again the 
recognition error goes back to chance. 

In the next section, we tackle the opposite limiting case, when the observer is om- 
nipotent. 



8.4.3 Active recognition bounds 

The next step is to test the error in an active sensing scenario, where one can control 
the scale parameter s (e.g. via a zoom lens) as well as translation t along the image 

12 Linear-sparse models are often used as a proxy for occlusions, where a certain region of the image is 
explained either by one basis (e.g. the foreground) or by another (e.g. the background). This corresponds to 
sparse coding with an 1° norm. This is usually relaxed to an t 1 optimization, under the understanding that, 
while images are not composed as linear combinations of foreground and background objects, they can be 
approximated by linear combinations of "few" bases (ideally just one). 
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plane. In this case, instead of marginalizing with respect to the distribution of unseen 
portions of a scene, one performs a max-out procedure by computing the minimum 
of the integrand with respect to s. In this case, the posterior can be made arbitrarily 
peaked and, therefore, the error can be made arbitrarily small. To see that, consider 
(8.38) and note that, for any ti, by choosing s —> 1 (i.e., by getting arbitrarily close to 
the object via a translation along the s-axis), and taking repeated measurements Ii(k), 
one can obtain an arbitrary number of independent measurements of p{U) and, by the 
assumptions made on the noise rti and the Law of Large Numbers, achieve an estimate 
of p(ti) that is arbitrarily accurate (asymptotically) p(U) — > p(U). Now, if one could 
not translate along the t-axis, this would be of little use, because knowing p at the 
(discrete) set of points t{ would not prevent infinitely many p, that coincide with the 
given class-conditional variable p on the samples U, but sampled from the alternate 
distribution, to fill an infinite volume. However, if we are allowed to translate by an 
arbitrary amount r, then for any t we can obtain an arbitrarily accurate estimate of 
p(t) = p(U + r) by sampling at r = t — U. 

In formulas, we start from the error R* = f Vp( i \ c = l)p(I\c= -l)dl but, instead of 
computing p(I\c) by marginalization as done in (2.12), we compute it by extremization, i.e. 

p{I\c) = J sup p(I\p,s)dP(p\c). (8.48) 

Since p(I\p, s) = Y\ i M{I% — p*J\f(sU; (se) 2 ); a 2 ), the maximum is obtained for s —> 1, when 
we have 

I % = p *A^(t 2 ;e 2 ). (8.49) 

Now, assuming that p is finitely generated, or adopting the assumption of sparsity, or the statistics 
of natural images, we have that for any neighborhood of size e of a given point U, there exists 
an integer TV* and a suitable (overcomplete) basis family {b k : Bt i (e) —> M} k=1 such that for 
every |r| < e 

N* 

P(U + r) = ^2b k {r)a k {U) = B(r)a h (8.50) 

k=l 

where the N* dimensional vector oti = a(U) is small in some norm. Therefore, the measured 
samples are given by 

N* r N* 

Ii = Y. N(U-T]e 2 )b k (T)dra k (U) = ^b k (0)a k (U) = B(0)a t (8.51) 

k=lJ / k=l 

v 

that shows that the coefficients on representing p with respect to the overcomplete basis {b k } 
are the same as those representing / with respect to the basis {b k } in a neighborhood of each 
sample point U. The original basis {b k } can be chosen in such a way that, in addition to being 
overcomplete, it makes the "blurred basis" {b k } overcomplete as well (there are some subtleties 
having to do with the coherence of the basis that we do not delve in here [122]). 

This shows that p can be sampled around any U by simply computing the (sparse) represen- 
tation of Ii, ai and using it to encode p with respect to the basis B, so that p(t) — B(U + t)clu 
with r = t — ti. When the neighborhood size e is too large relative to the band of p, or to 
its sparsity index, additional sampling can be performed by controlling translation along the t 
axis. In fact, because we are free to translate by an arbitrary (continuous) amount r, we can 
obtain multiple samples of / around any U by choosing 5r < such that t = U + kSr, with 
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k = 1, . . . , K. Therefore, by purposefully choosing s, 6r, we can obtain on and therefore p(t) 
at any value of t. If we consider I (St) to be a (sub)-sampling of I(t) at intervals of 5r, 

I(5t) = {I ti+ k6r}f,k=i (8.52) 

then we have that 

sup p(I(6r) I p,s) ->6(I-p) (8.53) 

s,8t 

and therefore the error rate becomes, asymptotically, 

R*(m,* p ) = f (Jf supp(I(Sr)\ Pl s)dP(p\c = 1) ^ J sup p(I(Sr)\ Pl s)dP(p\c = -L)j dl 
= f Vp(p\c=l)p(p\c=-l)dp(p) < (8. 

(8. 



]J J P(p\c = l)p(p\c = -l)dp(p) = 



following (8.43) and Jensen's inequality. Since the densities p(p\c) are positive, we conclude 
that the error can be made arbitrarily small for any m and a p . The above constitutes a sketch 
of a proof of the following statement. 

Proposition 2 (Active recognition for scale-quantization) The error bound in a vi- 
sual decision problem subject to scaling and quantization nuisances can be made ar- 
bitrarily small by an omnipotent observer. 

The sketch above applies to the case where there are no occlusions. We tackle this case 
next. 

Let D\Q be the visible portion of the domain of p(t). In our case, this is just an interval 
of the real line, D\Q = [a, b]. The data generation model for a sample point U is, again, the 
average in the quantization area [U — e, U + e] of the scene, which is p(t) in D, and something 
else outside, call it /3(t), the "background." Thus we have 

Ii = p(st)dt+ I P(t)dt + m (8.56) 

JDDBeiti) J D c nB e {ti) 



/»min(6,ti+e) />max(a,t^— e) rc 

j p(st)dt+ / P(t)dt+ / 



(3(t)dt + m 



and consequently 



fTl h v ^(l t -p^^(sU;(se) 2 y,a 2 ) U e [a + 6, b - e] 
p(Ii\a,b,p,s) = < - v (8.57) 
I ft otherwise. 

where ft is a white process. In other words, only the samples Ii captured at U such that the 
entire ball of radius e centered at U is contained in D are "pure." All the other samples that are 
either outside or whose quantization region intersects the background are unpredictable and are 
therefore considered a white process. This is an approximation, since in practice samples whose 
quantization region intersects the boundary are not independent of p. However, because we do 
not know anything about ft and the samples are non-overlapping (the pixel arrays determines a 
partition of the image plane, or e = \U+i — U\), we can safely assume that pixels covering a 
mixture of the foreground and the background carry no information about the former. 
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The joint distribution of the measurements thus factorizes, and is calculated in detail in [99]. 
In general, the distribution dP(b) will be a scaled family depending on a parameter, the "clutter 
density", that is also a function of the maximum scale s m ax [138]. This, in turn, is a function 
of the number of pixels, since we can only resolve objects at best up to a scale beyond which 
they project onto an area smaller than a pixel. In general, it is not possible to control all these 
variables. Instead, for any given pixel size (this is the variable we can control) there will be a 
sufficiently large space (s max ), and a sufficiently high clutter density so that the sensitivity of the 
likelihood on p, the "score", ^ log ^ J l p ) ? i s arbitrarily small, and therefore once averaged against 
the density dP(b) the expected error is arbitrarily large. In other words, for any arbitrarily 
large number K, and any arbitrarily fine pixel sampling e, one can find a sufficiently 
large depth range s max = s max (e) and a sufficiently high clutter density bo so that if 
dP(b) = £(b;bo)db, the error will be higher than K. 

Vice-versa, if one is allowed to control (s, t), then these can be chosen so that the 
entire domain [a, b] is sampled (by controlling s) at an arbitrarily fine spatial sampling 
(by controlling t). Therefore, in an active setting the error rate can be made arbitrarily 
small. We summarize this argument, which is spelled out in more detail in [ ], in the 
following statement. 

Proposition 3 (Active recognition bounds) An omnipotent observer can make the av- 
erage error in a visual decision task arbitrarily small. 

8.4.4 Control-Recognition tradeoff 

The previous section established that for a completely passive observer, the expected 
error in a visual decision problem is at chance level, and that for an omnipotent explorer, 
the error can be made arbitrarily small. The goal in this section is to quantify the 
tradeoff between control authority and performance in a vision-based decision task. 
For simplicity, we refer to the generic visual decision task as "recognition," so what we 
are trying to quantify is the trade-off between Control and Recognition. 

What makes the error arbitrarily small for the omnipotent explorer is its ability to 
invert the nuisances. So, to establish the trade-off we need to establish a relation be- 
tween the "degree of invertibility" and the control. This comes from the data formation 
model. As usual, we assume the Lambert-Ambient-Static model (2.6). We neglect 
contrast transformations (or equivalently assume that they have been canonized as de- 
scribed in Section 6). Also, for the moment, we neglect quantization, which can be 
inverted by the control action of zooming in. This is straightforward when neglecting 
optical artifacts such as diffraction. The data formation model then reduces to 



where p : R 2 -> S C M 3 , p : S -> JR + . This is just the usual Lambert- Ambient model (2.6) 
with time explicitly indicated. Here p — means that the radiance changes slowly relative to the 
sampling frequency, and similarly p = means that the surface S deforms slowly relative to the 




< 



p = 
9 = ug 

It(x t ) = p(p) y P eg^ 1 7r- 1 (D)cS 
^x t = irgtp, x t e Dn irg t (S) 



(8.58) 
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sampling frequency. Note that the map 7r, and its inverse 7r _1 , entail self-occlusions. Invertibility 

of the nuisance relates to the observability of the state of the dynamical model above. Since p, p observability 

can be thought of as (infinite-dimensional) parameters of the model (and be represented as states 

with trivial dynamics), observability of the states is related to the identifiability of the parameters, 

which in turns depends on the properties of the input u G W, specifically on it being sufficiently 

exciting. So, we will be studying the identifiability/observability of the model above, and the 

sufficient excitation properties of the input. 

In terms of control authority, this can be measured by properties of the controllability dis- 
tribution [84]. If we write the model above in the standard form x = f (x) + g(x)?z, where CONTROLLABILITY DISTRIBUTION 
x — {piPi 9}i we see that f = (i.e, the model is driftless), and g is trivial, because the scene 
components p, p have dynamics that are (trivially) decoupled from the nuisance g (the two are, 
of course, coupled in the measurement equation). Therefore, the properties of the controllability 
distribution 13 boil down to the properties of the input delay line 

C = {u,u,u,...}. (8.59) 



In particular we will be interested in the volume of this set, which we call the control au- 
thority. A controller capable of spanning more space, or spanning it faster, will have larger 
volume. Of course, if more complex constraints were imposed on the dynamics, for instance 
non-holonomic constraints if the sensor was mounted on a wheeled vehicle with steering and 
throttle controls, there would be a non-trivial drift, and the full controllability distribution (see 
Footnote 13) would have to be taken into account. 

We now argue that the model (8.58) is observable in the absence of occlusions. 
Observable in this context means that the state can be inferred uniquely from the 
output (measurements), modulo an arbitrary choice of similarity reference frame (a 
global translation, rotation and scale). This follows from the assumptions underlying 
the model (8.58) and Theorem 3 in [174]. As a corollary, we can say that p(-) and p(-) 
are observable in one step (with one level of differentiation) in the co- visible regions. 
Therefore, in this sense, the invertibility of the nuisances is trivial in the absence of 
occlusions, so we focus on occlusions. Specifically, we want to relate observability 
(quantifying the portion of the state space that can be determined from the outputs) to 
reachability (quantifying the portion of state space where the state can be steered to by 
action of the inputs). This in turn will relate to the volume of the input delay line (8.59) 
in a trivial manner. 

To do this, we first perform model reduction. We parametrize S as usual (Section 2.2) rep- 
resenting the generic point p as the (radial) graph of a scalar, positive-valued function Z : D —> 
R + , via p(x) = xZ(x), where x is the homogeneous coordinate of x. From the assumptions 
underlying (8.58), we have that lo(xo) = p(p) — h(xt) where x t — ngtP — 7rg t x~oZ(xo). 
This allows us to eliminate p in the co-visible regions (since p(p) = p(xoZ(xo)) — Io(xo)), 
and reduce the representation of S to the scalar function Z: 

'Z = 
q — uq 

(8 60) 

Iq(x) — It(irgtxZ(x)) Vx G D D {irgtxZ(x) \ x £ D}. 

w t (x) 



CONTROL AUTHORITY 



OBSERVABILITY 



REACHABILITY 



13 The controllability distribution is computed using the Lie derivative of g along the vector field f , i.e. 
C = {g, Lf g, L|g, . . . }, which in the trivial case f = reduces to the delay line (8.59). 
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Now, given regularity assumptions on Z (e.g. sparsity [138], piece- wise smoothness [93], or 
its finitely-generated nature), and on Jo, h (for instance that they be non-trivial, V/ ^ 0) and 
the fact that the translational component of g t is non-zero, [174] shows that Z(-) is observable 
from the above model. So, in the absence of occlusions there would be no problems with visual 
decision tasks. 

However, I t (x) is defined for all x <E D, and not just x G D fl w^ 1 (D). Therefore, 
as we have seen earlier in this Chapter 8, {I t (x),x G (D)} is the information gain 

provided by the current image. We could then repeat this exercise for the current image I tl 
and the next image Jt 2 , and discover a new portion of the domain of the scene in each time 
instant. Unfortunately, this needs to be done with respect to a new parametrization. Indeed, at 
each time instant the visible portion of the scene can be parametrized as the graph of a function, 
but it is a different function at each time. We will call this function Z t {x), and refer to the 
function Z(x) above as Zo(x). We then have that, in the reference frame of the camera at time 
t, we have pt(x) = xZ t (x) = gtPo, and at t = we have po(x) = xZo(x). Therefore 
g t xZ (x) = xZ t (x) and hence 



and consequently 



Z t (x) = C^Zo(x) (8.61) 

X 1 X 



Io(x) = p(p), V p = xZ (x) , x G D (8.62) 



It(x) = pip), \/p= ^gtxZo(x) C S. (8.63) 

X 1 X 



Pt(x) 

The portion of the domain of the scene covered at time t is pt(D) C S. The coverage up 
to time t is denoted by p t (D), following a convention standard in system identification, where 
p l (D) = p\ (D) Up2 (D) U • • • Up t (D). This shows that, at time t, a portion of the domain 
ft C D is occluded if its pre-image is contained in the portion of the scene that is 
covered at the current time, but at no previous time: 

x en^p t (x) eptiD^p'-^D) (8.64) 

and, consequently, the information increment, the complexity of the innovation defined 
in (8.18), can be measured observing that 

{/*(*), xeQ} = p e ptMSpt-^D)}. (8.65) 

So the information increment is the ART of {I t {x), x G D}, and its uncertainty is the 
AIN defined in (8.21). This clarifies the relationship between Actionable Information 
and the mutual information 1(1; £) discussed in Section 2.5. Finally, we have that the 
observable set is given by 

sup p\D) *±¥ S/SE(S) x R (8.66) 



and, therefore 



|J I T (x) = P (p\ epHD) ^ p/W = ART(p) (8.67) 
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where W is the set of domain diffeomorphisms [176]. This shows that the entire scene 
is asymptotically observable modulo a similarity reference frame. The volume of the 
observable space is finite so long as the scene is restricted to a compact set, but other- 
wise can be infinite. 

On the other hand, the reachable set given a control {u T G W}* =0 is given by 

t 

max M g~ 1 p\D), s. t. g T = u T g T . (8.68) 

{u T ewH=o T 3o 

Thus, the reachable set is a subset of the observable set, which is asymptotically the en- 
tire state-space. Since these can have infinite volume, it is more convenient to visualize 
the complement of the observable set, which is the volume of the set of indistinguish- 
able states, as a function of the volume of the control delay line (8.59). This follows INDISTINGUISHABLE STATES 
a generally monotonically decreasing curve, with a vertical asymptote, where with no 
control authority the volume of the indistinguishable states is infinite (and recognition 
performance is at chance level, per Proposition 1), and a horizontal asymptote along the 
abscissa, where infinite control authority asymptotically yields complete observability, 
and therefore arbitrarily good recognition performance (Proposition 3). In between, 
such a Control-Recognition curve quantifies the achievable error for any given bound 
on the control authority. This hypothesis is verified empirically and in simulation for 
the Cartoon Flatland model in [191, 79, 78, 99]. 

It should be noted that the problem of measuring the volume of the controllable 
space, or more in general of quantifying the "control authority" of a system, is an active 
area of research, where many ideas on how to define a partial order on the space of 
controllers are being developed. We have explored the very simplest case of an entirely 
kinematic agent that can be instantaneously controlled, and we leave extensions to 
more complex classes of systems for further investigation. 
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Chapter 9 

Learning Priors and Categories 



Previous chapters have shown how to handle canonizable nuisances (Section 3), and 
non-invertible nuisances via either marginalization (2.12), extremization (2.13), or - 
if we are willing to sacrifice optimality for decision-time efficiency - by designing 
invariant descriptors (Chapters 4 and 6). In all cases, the design of a visual classification 
algorithm requires knowledge of priors on the nuisances as well as on the scene. 

The procedure we have outlined for building a template (2.30), or a Time HOG 
descriptor (Section 6.4) using a training sample {I t }J =1 ~ p(I\c), assumes that a 
sequence of frames g t is available (Section 5.1). Thus we have implicitly assumed that 
the scene £ is the same, not just the class c. Indeed, before Chapter 7 we even assumed 
that the underlying scene was static, which enabled us to attribute the variability in 
the data to nuisances, rather than to intrinsic factors. Even in Chapter 7 we assumed 
that local variability was due to nuisance factors, and in both cases this enabled us to 
determine the object-specific nuisance distribution. 

What we have deferred until now is the possibility for intrinsic (intra-class) vari- 
ability. It cannot be realistically assumed that an explicit model be available for all 
classes of interest, and therefore such variability should be learned, although it is likely 
that some basic components of models can be shared among classes. In this chapter we 
describe an approach to build category models starting from the local representations 
described in previous chapters. 

The starting point are occlusions, detected as described in Section 8.1. These can 
be used to bootstrap a partitioning of images into detachable objects as we will show in 
Section 9.1. Such a "segmentation" process is different than traditional (single-image) 
segmentation, and can be accomplished through relatively simple computations (linear 
programming). Such detachable objects can then be tracked over time (Section 9.2), 
providing the support where the local descriptors of Section 6.4 can be aggregated. 

However, knowledge that a certain descriptor belongs to a certain object - while 
an improvement on than the so-called "bag-of-feature" approach that considers the dis- 
tribution of descriptors on the entire image - still fails to capture important geometric 
relationships. In some cases, there may be an advantage in further subdividing objects 
into "parts", or subsets of descriptors based on spatial relations (Section 9.3). 

This process produces a model of a specific object, removing nuisance variability 
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from the data. In order to represent intrinsic variability and arrive at a categorical model 
it is necessary to aggregate different instances of the same class. We will assume that 
the class label is provided as part of the training process. This is because sometimes 
categories can be defined based on non- visual characteristics. For instance, an object 
can be called a chair if someone can sit on it, which is a functional property that may 
or may not have visual correlates. Thus the class distribution of chairs may include 
largely disparate objects with different shapes and materials. 

This approach is somewhat different from the conventional approach to object cate- 
gorization, that aims to detect or recognize the category before the specific object. Such 
a literature is motivated by studies of primate perception, where there is an evolutionary 
advantage in being able to assess coarse categories (e.g. animal or not) pre-attentively. 
The category model that is implicit in many of these approaches is not very different 
from an object model, and an effort is under way to make these models more specific 
and more discriminative to perform fine-scale categorization, or in other words getting 
closer to models of individual objects. In our case, objects come before categories. 
When we learn a model of a chair, we first learn a model of this particular chair. The 
fact that someone may sit on it makes it a chair just like another one regardless of its 
shape and appearance. This also allows one particular object to be easily attributed to 
multiple categories: A toy elephant can be a chair if someone can sit on it. Of course, 
there may be particular categories that are defined by visual similarity, in which case 
one can expect relatively simple categorical distribution that is well summarized by 
few samples. 

In summary, a (detachable) object is one that triggers occlusions, and that supports 
a collection of descriptors that are organized into parts. A collection of objects induces 
a distribution of parts and their spatial configuration. Such a distribution can be multi- 
modal and highly complex, and often requires external supervision to be learned. Of 
course, descriptors and parts can be shared among objects and also among categories, 
but this is beyond our scope here and is the subject of current research. 

9.1 Detachable objects 

Detachable objects were defined in [8, 9] as (closed and simply-connected) surfaces in 
space that are partly surrounded by the medium (air). A consequence of this property 
is that, as soon as there is either object or viewer motion, occlusion phenomena occur 
in the imaging process. Occlusion regions then, inferred as described in Section 8.1, 
provide local depth ordering information: Around each occlusion, one knows that the 
portion of the scene that projects onto the occluder is closer to the camera than the 
portion of the scene that projects into the occluded. Unfortunately, this information 
is available only around occlusions: [ ] showed how to extend this information into a 
global ordering. The contribution of [9] was to show that this can be done by solving 
a linear program, including estimating model complexity (number of objects). If the 
relative motion between object and viewer is small, occlusions may go undetected. 
However, if an extended temporal observation is available, detachable objects can be 
aggregated over multiple views (Fig. 8.3). 

Under the assumption of Lambertian reflection, constant illumination and co- visibility, 
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I t (x) is related to its (forward and backward) neighbors It+dt (x), h-dt (x) by the usual 
brightness-constancy equation 

It(x)=It±dt(x + v± t (x))+n±(x), xeD\Q±(t) (9.1) 

where v+ t an d v -t are the forward and backward motion fields and the additive residual 
lumps together all unmodeled phenomena and violations of the assumptions. In the co- 
visible regions, such a residual is typically small (say in the standard Euclidean norm) 
and spatially and temporally uncorrelated. Following Section 8.1, we assume to be 
given, at each time t, the forward (occlusion) and backward (discovered) time- varying 
regions ft+(£), ft_(t), possibly with errors, and drop the subscript ± for simplicity. 
The local complement of ft, i.e. a subset of D\ft in a neighborhood of ft, is indicated 
by ft c and can be obtained by using morphological dilation operators, or simply by 
duplicating the occluded region on the opposite side of the occlusion boundary (Fig. 
9.1). 




Figure 9.1: Left to right: ft_(t) (yellow); ft+(t) (yellow); ft (yellow) and ft c (red) on 
the 168 th frame of the Soccer sequence [22]. Segmentation based on short-baseline 
motion does not allow determining whether the right foot and leg are detachable; how- 
ever, extended temporal observation eventually enables associating the entire leg with 
the body, and therefore detecting the person as a whole detachable object. 



In order to detect the (multiple) detachable objects, we must aggregate local depth- 
ordering information into a global depth-ordering model. To this end, we define a label 
field c : D x R + — >> Z + ; x \-> c(x, t) that maps each pixel x at time t to an integer 
indicating the depth order, c(x,t). For each connected component k of an occluded 
region ft, we have that if x G ftk and y G then c(x, t) < c(y, t) (larger values of 
c correspond to objects that are closer to the viewer). If x and y belong to the same 
object, then c(x,t) = c{y,t). To enforce label consistency within each object, we 
therefore want to minimize |c(x, t) — c(y, t)\, but we want to integrate this constraint 
against a data-dependent measure that allows it to be violated across object boundaries. 
Such a measure, dfi(x : y), depends on both motion and texture statistics, for instance, 
for the simplest case of grayscale images, we have d[i(x,y) = W(x, y)dxdy where 

ae -(h(x)-I t (y)) 2 + p e -\\v t {x)-v t {y)\\l ^ _ ^|| < ^ 

otherwise; 



W(x,y) = 



where e identifies the neighborhood, a and (3 are the coefficients that weigh the inten- 
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sity and motion components of the measure. We then have 

c=arg min / \c(x, t) — c(y, t)\da(x, y) 

s. t. c(x,t) < c(y,t) V x e fife (t), 2/ G fifc(t), 

with fc = 1, . . . , if and ||x — 2 < e. This problem would be solved trivially by a 
constant, e.g., c(x) = 0, if it were not for the conditions imposed by occlusions. 

Starting from this formalization, [ ] shows how to convert (9.3) into a linear pro- 
gram (Section 2 of [ ]), how to automatically choose the (unknown) number of ob- 
jects (Section 3), how to use an extended temporal observation to more reliably detect 
detachable objects (Section 4), and how to compensate for errors in the occlusion de- 
tection process (Section 5). We do not summarize these results here as the reader can 
easily access them in [8]. 



9.2 Object tracking 

Detachable object detection provides a segmentation of the domain of a time- varying 
image. It does not, however, provide temporal association between points within each 
object, although the optical flow that is computed as part of occlusion detection does. 
Detachable object detection also provides a mechanism to group feature descriptors 
that belong to the same object. This is a departure from bag-of-feature models that 
lump together descriptors in the entire image. 

This segmentation also provides a mechanism for tracking objects, as opposed to 
individual features. Often object tracking is based on a user providing a "bounding 
box" for the object of interest. This can be considered as the "training set" and used to 
detect the object in subsequent frames. This, however, presents two problems. 

First, normally in a detection task one is given as a training set a fair sample from 
the class-conditional distribution, from which one can build a classifier and use it to 
detect a new instance in the test set. Here, it is the other way around: One is given one 
instance as a training set, and then is asked to classify samples in subsequent frames. 
For anything other than simple cases, shape and appearance change, and therefore the 
one training sample is typically not representative of the class conditional distribu- 
tion. This has prompted many to frame the problem of tracking in the framework of 
semi- supervised learning, where detection at time t uses a labeled sample at the initial 
time, and all unlabeled data prior to time t. Unfortunately, if the missing labels are not 
marginalized, and instead classification of past data is used ad "ground truth" labeling 
for future data, this strategy typically fails. 1 An independent mechanism for side infor- 

lr The sigma-algebra generated by successive classification, or "self-learning", is identical to the sigma 
algebra generated by the first sample. If we call c the discrete decision variable associated to the presence of 
an object at frame t, then it is trivial to show that 

H(c t+1 1£ , co, /*) = H(c t+1 1£ , c , /*, (9.4) 

so the information gain from the intermediate labels c* is zero: Adding positive and negative outputs from 
the classifier to the training set cannot, in general, improve the classifier. 
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Figure 9.2: Challenges of tracking-by-detection: The P/N Tracker [ ] (first row) 
drifts because the target changes appearance and never returns to the initial config- 
uration, and never recovers past frame 55. MIL Track [ ] (second row) locks on a 
static portion of the background and fails at frame 208. Both phenomena are typical of 
tracking-by-detection approaches based on semi- supervised learning without explicit 
side-information. Introducing an occlusion detection stage, and enforcing label consis- 
tency within visibility-consistent regions (detachable object detection) allows tracking 
the entire sequence from [ ] despite large scale changes, changing background, and 
significant target deformation (third row). Of course, this approach fails too (failure 
modes: bottom row), when the target is motion-blurred or subject to sudden illumina- 
tion changes (frames 349 and 403 respectively) but quickly recovers (frames 352 and 
417 respectively), missing 17 frames out of 1496 (98.86% tracking rate). The green 
box is the state £ t , black is the state covariance; blue is the prediction, red is the convex 
envelope of the detachable object. 

mation is necessary to avoid "self-learning." Occlusion regions and detachable object 
segmentation provide such a mechanism. 

Second, even the first training sample is typically not "pure," for the bounding box 
often includes pixels on the object of interest, as well as pixels on the background, 
and the system does not know which one is which. This is why some have framed 
the problem as a multiple-instance learning task, where one is given a "negative bag" 
(pixels outside the bounding box) that is known to not contain the object of interest, 
and a "positive bag" (the bounding box) that contains at least some pixels on the object 
of interest, but one does not know which ones. Again, detachable object segmentation 
provides a way to discriminate "foreground" pixels within the bounding box (the largest 
detachable object within) from "background." These problems are evident in Fig. 9.3. 

Selecting a (point estimate of) the class labels is only sensible if the posterior 
is sharply peaked around the (possibly multiple) modes, so it can be approx- 
imated with p(c\I) ~ J2 i S(c — c 1 ((f)(1))) . When this is not the case, one has to 
consider all possible labels assignments with their probabilities. In other words, the in- 
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termediate labels have to be marginalized, rather than selected. If we assume a Markov 
structure, where the statistic 2 £ t = (j)(P ) summarizes the history of the data up to t, or 
c t JL I 1 | this is done by Chapman-Kolmogorov's equation: Starting from an initial 
condition, p(co, £o |^o)> one can iteratively compute, given p(c t , & the prediction 

pict+uMi*) = / (9.5) 

from which, using Bayes' rule, one computes the update 

p(ct+i,ft+ii ) — "F~77 1 ; ^7 7 im- ( 9 ' 6 ) 

J p(it+i|ct+i,ft+i)^P(ct+i,ft+i|I*) 

The state transition p(ct+i,€t+i\ct,€t) may factorize into p(ct+i\ct,€t)p(€t+i\€t), 
and the measurement equation p(/t+i|ct+i, £t+i) simplifies given the deterministic 
relation between <^ + i and It+i. These dependency relations can be expressed by a 
(semi-)generative model for the realizations as follows: 

^1+1 = /(a) + vj, * = 1, . . . , N(t) 6=o = & 
ct+i = g(ct,€t,wt) c t=0 = c (9.7) 
= ft 

where v t G R fexAr ( t \ iy t g M m are suitable noise processes, and /, g are tempo- 
ral evolution models that could be trivial (e.g. /(£) = ^ for a Brownian motion, and 
^f(c, ^, w) = c.) Eq. (9.5)-(9.6) are the filtering equations that propagate all possible 
label assignment probabilities. Computing these amounts to solving the filtering prob- 
lem (say with a particle filter) in several thousand (variable) dimensions kN(t). This 
is a tall order. So, we have one option (estimating a point estimate of the labels, (9.4)) 
that is futile, and another (marginalization, (9.5)-(9.6)) that is unfeasible. 

In order to make the intermediate labels c t informative we must introject side in- 
formation. This can be either measurements that enlarge the sigma- algebra of the mea- 
surements (e.g. a range sensor), or priors that sharpen the posterior. Since we do not 
want to perform additional measurements, we focus on the latter. 

Instead of providing labels c t we may know that features are lumped, at each instant 
t, into two groups, so that features within each group have high probability of having 
the same (unknown) label. Detachable object detection provides such a grouping of 
features, that is then considered as a form of "weak labeling." 

Clearly, there is no guarantee that a tracker based on this side information will 
always work, as one can conceive scenes where both the target and the background 
change discontinuously and rapidly enough that any point-estimate misses the mode. 
Some failure modes are described in Fig. 9.2, bottom row. Also, this approach is 
no substitute for proper marginalization, when the latter is possible, and remains a 
heuristic. 

In order to accommodate the possibility that the training set contains positive "bags" 
where only some of the pixels are on target, we can extend the model above in the 



2 We omit the "hat" in the representation £ for simplicity. 
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framework of multiple-instance learning (MIL) as follows: 

'6+1 = /(&) + v u 6=o = Co 

w t+ i=k(w u c u I t ), w t =o=w /n Q , 

= #(6> n t) c t=0 = c 

>Wi4J = ^({4 H ) = 6 

where k; denotes the parameters of a classifier [ ] . Note that the initial condition for 
the classifier wq is given by assuming that all samples in the bounding box are positive, 
and all those outside are negative. The actual classification c uses the side information 
at all subsequent times t > 0. We refer the reader to [207] for an instantiation of the 
inference process based on the model above. 



9.3 Parts 

Hierarchy and compositionality are often cited as principles for representing complex 
processes. The basic idea is that a large variety of objects can be built by compos- 
ing a relatively small number of "parts", with composition rules that can be repeated 
recursively. Unfortunately, however, it is unclear what a "part" is. There have been 
many attempts to define and operationalize the notion of parts, from skeleton models 
to groups of features, etc. But, like for the case of features, it is difficult to see a reason 
why breaking down objects into parts may be beneficial for the purpose of classifica- 
tion. In this section, following [213], we conjecture a possible explanation. That is, we 
point out that in some cases, discriminating an object ("foreground") from everything 
else ("background") may be more easily done if the object is broken down into subsets, 
which we call "parts." In this case, parts are defined and inferred automatically during 
the classification process, as opposed to being defined in an ad-hoc fashion that is not 
tied to the task at hand. 

In [213], the notion of "Information Forest" has been motivated by the fact that 
some objects may be difficult to discriminate from the background if they are con- 
sidered as a whole. So, in a typical divide -et-imp era strategy, the problem is broken 
down into smaller problems that may be easier to solve. In fact, they will be easier to 
solve, since the criterion for breaking the original problem is precisely the simplicity 
of the ensuing classification. Only when classification is "easy enough," which can be 
measured, is it actually performed. 

This program is carried out in the context of sequential classifiers. In a Random 
Forest classifier [ ], the data is recursively partitioned in such a way that the parts are 
as "pure" as possible, in the sense of sharing as much as possible the same training 
label. Purity can be measured by the entropy of the label distribution. However, in 
some cases no pure clusters can be determined, so rather than attempting to split the 
data based on purity (minimum entropy), one can attempt to split them by mutual 
information, that is by attempting to make the cluster distribution not pure, but easy 
to separate (minimizing the mutual information between the resulting clusters). Only 
when the mutual information is sufficiently small is the classification problem easily 
solvable, and therefore one can split based on entropy. 
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Following [213], we illustrate the principle of Information Forests for a binary 
decision problem with a class label c G {0, 1} using simplified notation whereby 
x G D C with k = 2,3 is a location variable, and y : D — >> Y, x i— > a 
measurement (or "feature") associated to location x, that takes values in some vector 
space Y. When the domain D is discretized (e.g., the planar lattice), x can be identified 
with an index i G c \ xi G D. In that case, we indicate simply by yi. We are inter- 
ested in partitioning the spatial domain D into two regions, Vt and D\Q, according to 
the value of the feature y{x). This can be done by considering the posterior probability 

P(c\y) (xp(y\c)P(c), (9.9) 

where the first term on the right hand side indicates the likelihood, and the second term 
the location prior. It should be clear that meaningfully solving this problem hinges on 
the two likelihoods, p(y\c = 1) and p(y\c = 0) being different: 

p(y\c=l)^p(y\c = 0). (9.10) 

If this is the case, we can infer c and, from it, Q = {x | c(x) = 1}. However, there 
are plenty of examples where where (9.10) is violated. A simple example is shown in 
Figure 9.4. 

We refer to problems where the condition (9.10) is violated as problems that "are 
not solvable as a whole ", in the sense that we cannot segment the spatial domain simply 
by comparing statistics inside ft to statistics outside. Nevertheless, it may be possible 
to determine parts, or local regions Si C D, within which the likelihoods are different: 

3 {Sj}? =1 | P (y\x G^,c = l)/ p(y\x e Sj, c = 0), 

Sj C D, j = l,...,iV. (9.11) 

Note that the collection {Sj} is not unique, does not need to form a partition of D, 
as there is no requirement that Si D Sj ^ for i ^ j, so long as the union of these 
regions cover 3 D. The regions Sj do not even need to be simply connected. In some 
applications, one may want to impose these further conditions. 

In the discrete-domain case, we identify the index i with the location Xi, so the 
regions become subsets of the data. With an abuse of notation, we write 

Sj = {ii,i 2 , • • • ,i nj }- (9.12) 
Therefore, we write the two conditions (9.10)-(9.11) as 



p(yi\ci 1) =p(yi\ci 0), 



p(yi\i G Sj, Ci = 1) ^p(yi\i G Sj,Ci = 0). 



(9.13) 



Assuming these conditions are satisfied, we can write the posteriors by marginalizing 
over the sets Sj , 

p(c\y % ) cc J2p^ I i e ^c)P(i G Sj\c)P(c) (9.14) 



3 Indeed, even this condition can be relaxed to assuming that these regions cover the boundary of fl, 
Uj Sj 3 dfl, by making suitable assumptions on the prior p(c\x). 
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or by maximizing over all possible collections of sets {Sj}. In either case, the sets 
Sj are not known, so the segmentation problem is naturally broken down into two 
components: One is to determine the sets Sj, the other is to determine the class labels 
within each of them: 

Given a training set of labeled samples 

Find a collection of sets {Sj}^ such that Sj C D and D c UjSj, that are "as 
informative as possible" for the purpose of determining the class label c. 

If the sets are "sufficiently informative" of Q, perform the classification; that is, deter- 
mine the label c within these sets. 

The key condition translates to the restricted likelihoods p(yi\i G Sj,c = 1) and 
p(yi\i G Sj,c = 1) being "as different as possible" in the sense of relative entropy 
(information divergence, of Kullback-Liebler divergence). When they are sufficiently 
different, the set is sufficiently informative of Q, and classification can be easily per- 
formed by comparing likelihood or posterior ratios. 

In an Information Forest, the groups ("clusters", or "regions") Sj C D are chosen 
within a class S defined by a family of simple classifiers (decision stumps). For con- 
venience, we expand the index j into two indices, one relating to the "features" fj and 
one relating to a threshold Ok. We then define, for a continuous location parameter x 

Sj k = {xeD\f J (x,y)>O k } (9.15) 

where the feature / : D x 7 1; (x, y) \-> f(x, y) is any scalar- valued statistic and 
the threshold G R is chosen within a finite set. We call the set of features T = {fj} 
and the set of thresholds 6 = {Ok}. The complement of Sjk in D is indicated with 
Sjk = {% £ D \ fj(x, y) < Ok} = D\Sjk. In the simplest case, for a grayscale image, 
we could have y) = y(x) where y(x) is the intensity value at pixel x. More in 
general, / can be any (scalar) function of y in a neighborhood of x. For the discrete 
case, where i is identified with the location Xi, with an abuse of notation we write 

Sjk = {iec\f 3 (y t )>0 k } (9.16) 

and again S? k = {i G c \ fj{y%) < Ok}. Here the features are / : c x Y —> R; (i,y) \-> 
f(yi). Specifying the feature and threshold (fj,0k) is equivalent to specifying the set 
Sjk and its complement Sj k . 

We are interested in building informative sets using recursive binary partitions, so 
at each stage we only select one pair {Sjk, Sj k }. Among all features in T and thresh- 
olds in 6, Information Forests choose the one that makes the set Sjk "as informative as 
possible" for the purpose of classification. From (9.13) it can be seen that the quantity 
that measures the "information content" of a set Sjk (or a feature fj,6k) for the pur- 
pose of classification is the Information Divergence (Kullback-Liebler) between the 
distributions p{yi \ i G Sjk,Ci = 1) and p(jji\i G Sjk,Ci = 0). In short-hand, we write 
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p(Vi\ • • • ,(H = 1) 3spi(yi\ • -) and p(yi\ • • • ,q = 0) asp (yi\ • • • ) and 

KL{fM = ^KU Pl ( yi \i G S) || po( yi \i G S))+ 

+ ^KL( Pl (^ G S c ) || p (^K G S c )). (9.17) 

From the characterization of the sets Sjk, i G 5jfc is equivalent to /j (2/i) > Ok, so 
we write Sjk = S(fj,9k). Therefore, a decision stump ("KL-node") chooses among 
features and thresholds one (of the possibly many) that satisfy 



£ A = argmax |5( {^ fc)l KL ( P i(yi\fj > k )\\p o (yi\fj > 9 k )) + 



\S c (fj,0 k ) 



Here KL(p||g) = E p In | = Jin ^dP denotes the Kullback-Liebler divergence. 4 

The normalization factors |5|/|D| and IS' !/!!)! count the cardinality of the set S and 
its complement relative to the size of the domain D. 

If the divergence value is sufficiently large, KL(fj,9k) > r, the positive and neg- 
ative distributions are sufficiently different, and therefore the classification problem is 
easily solvable. To actually solve it, one could use the same decision stumps (fea- 
tures) T, but now chosen to minimize the entropy of the distribution of class labels, 
p(ci\i G Sjk) — p(ci\fj > 9k), and its complement: 

H(fj,o k ) = m > D ° k) W i\fi > o k ) + l -^l^m( Ci \f J < e k ) (9.19) 

where M(p) = E p [\np] = J \npdP is the entropy of the distribution p. If the quantity 
(9.18) is sufficiently large, KL(fj,9k) > r, (9.19) can be solved. If not, the process 
can be iterated, and the data further split according to the same criterion, the maxi- 
mization of KL(fj,9k). The value r can therefore be interpreted as measuring the 
least tolerable confidence in the classification. 

Information Forests are a superset of Random Forest, as the former reduces to the 
latter when r = 0. While it has been argued [ ] that RF produces balanced trees, this 
is true only when the class T is infinite. In practice, T is always finite, and typically 
RFs produce unbalanced trees; IFs usually produce more balanced and shallower trees 
when the set of classifiers is restricted. 

More importantly for us, they provide a mechanism to automatically partition an 
object into "parts" based on a discriminative criterion. From this perspective, therefore, 
they can be thought of as a mixed "generative/discriminative" classifier that enables us 
to determine the constitutive elements of complex objects. 



4 Several alternate divergence measures can be employed instead of Kullback-Leibler's, for instance sym- 
metrized versions of it, or more general Jeffrey divergence. 
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9.4 Object category distribution 

Given a (training) video, one can detect occlusions (Sect. 8.1), bootstrap them into 
a segmentation of the image (detachable objects, Sect. 9.1), partition the objects into 
parts (Sect. 9.3), and aggregate feature descriptors (Sect. 6) by parts. This provides a 
collection not of feature descriptors, but of parts and their relation. Their relation can 
be topological (adjacency), or geometric (distance, orientation, shape). This is for a 
specific object seen in a video. 

If we want to aggregate a category distribution, where the category label is provided 
as part of the training set and may or may not have a visual correlate, we can repeat the 
above process for each object, and then construct a distribution of object using not just 
its feature descriptors, but also the collection of parts and their relations. The latter can 
be described by a graphical model, and there is a rich literature on graph and grammar 
models of relations. We will therefore not delve into that subject here. 

We will instead focus on what the constitutive elements of such a graph are. Rather 
than some abstract, hand-crafted attributes, the constitutive elements we work with are 
collections of descriptors that are learned from a subset of a video sequence that has 
been determined to be a "part." The process of aggregating these descriptors follows 
the procedure outlined in previous chapters. 

Starting from a video sequence 

It = h{g u ^v t ) + n t (9.20) 
one can obtained a sequence of canonized descriptors 

{ivt} = wmti. (9.2D 

When the data is gathered continuously, knowledge that the scene £ underlying all 
images I t is the same is provided for free. One can then attempt to infer both the scene 
and the nuisances, including occlusions (visibility) [54, 90], from the data. If, however, 
the data gathering process never "excites" the actions of certain nuisances, these will 
not be represented in the dataset, hence will not be learned. For instance, if all scenes 
in the training set are planar, one cannot learn dP(v), the prior on the diffeomorphic 
residual of the deformation in equation (8.9). The goal of the control is to generate 
training sequences that "invert" the nuisances, so that C(£) = £(£), and images can be 
hallucinated under the action of invertible nuisances only. Therefore, one can go from 
a passively learned template 

h = j IdP{I\c) = K9k,tk,Vk) (9.22) 

g k ~ dP(g) 
v k ~ dP[y) 
tk ~ dQc(0 

to an actively learned template where all nuisances are factored out 



i c = J IdP(<j>(I)\c) = h (9k,i^t)og^(u(t)) = h(e,i,0) (9.23) 

g k ~dP(g) 
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and the (controlled) group nuisance g(u(t)) = ifj(I(t)) does not contribute to the 
importance sampling. This formula illustrates the equivalence of the "ideal image" 
h(e, £, 0) with the representation £ that we have discussed in Remark 2. When the 
class is a singleton, no averaging occurs, and the template £ captures all information 
the data contains on a specific object. 

As we have mentioned, the inference process provides an estimate of the represen- 
tation (the collection of all descriptors on each of the parts and their relations) for one 
specific object. Once we repeat the process for each training collection (video) in the 
training set, we obtain a number of objects £i , . . . £/v that can be grouped together into 
a mixture distribution. 

Note that once learning has been performed, classification is possible on a snap- 
shot datum (a single picture) [ ] . We would not be able to recognize faces in one 
photograph if we 5 had not seen faces from different vantage points, under different illu- 
mination etc. Similarly, if we were never allowed to move during learning (or, more in 
general, to adapt the controlled nuisance to the scene), we would not be able to develop 
templates that are discriminative and at the same time insensitive to the nuisances. 

However, once we have established local correspondence (through g t ) to build 
a representation of an individual object £, then we can perform categorization by 
marginalizing, or max-outing, all the nuisances corresponding to different instantia- 
tions of the same class, i.e., samples from dQ c (^) = dP(£|c). For phenomenologically 
grouped categories (i.e. for categories of objects that share some kind of geometric (5), 
photometric (p) or dynamic characteristic), grouping can be performed by marginal- 
ization as follows. Given collections of images or video from multiple objects £ c all 
belonging to the same category, we can solve 

f,0fc,*>fc = arg min \\I k - h(g k , f, i/ k )\\* (9.24) 

where we have assumed that n(-) ~ Af(\\ • ||), so the maximum-likelihood solution 
corresponds to the minimum norm solution, and where the norm || • ||* can be the 
standard Euclidean norm in the embedding space of all images 1 1 1 — J | | * = 1 1 1 — J 1 1 , or 
- if some nuisances have been canonized - it can be a (cordal or geodesic) distance on 
the quotient I/G, where G C G is the group that has been canonized, or || I — J||* = 
— i( J) || for the case of a cordal distance. The problem (9.24), for the Ambient- 
Lambert case, has been discussed in [210], [53] and [86] in the presence of one or 
multiple occluding layers, respectively, and in particular in [174] it has been shown 
to be equivalent to image-to-image matching as described in Section 5.2. Once the 
problem has been solved, sample-based approximations for the nuisance distributions 
can be obtained, for instance 

M M 

dp {v) = Kv ^ v ^MM^); dP (g) = K g (g - gi)dfi(g)] (9.25) 

2=1 i=l 

where n are suitable kernels (Parzen windows) and M = N T NsN k . 



5 As we have already remarked, "we" here indicates the species, not necessarily the individual, as learning 
can occur in a phylogenetic fashion. 
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If the problem cannot be solved uniquely, for instance because there are entire sub- 
sets of the solution space where the cost is constant, this does not matter as any solution 
along this manifold will be valid, accompanied by a suitable prior that is uninformative 
along it. When this happens, it is important to be able to "align" all solutions so that 
they are equivalent with respect to the traversal of this unobservable manifold of the 
solution space. This can be done by joint alignment, for instance as described in [199]. 

When the class is represented not by a single template £, but by a distribution of 
templates, the problem above can be generalized in a straightforward manner, yielding 
a solution from which a class-conditional density can be constructed. 

M 

dQc(0 = J2 K ^~ii)M0- (9.26) 

i=l 

In this case it is important to ensure that the manifolds of the high-dimensional space of 
images X spanned by the scene, and that spanned by the nuisances, are as "transversal" 
as possible (Section B.3). 

An alternative to approximating the density Q c (£) consists of keeping the entire 
set of samples {£}, or grouping the set of samples into a few statistics, such as the 
modes of the distribution dQ c , for instance computed using Vector Quantization, as is 
standard practice. 

Note that these densities are defined in the space of representations, that is finite- 
dimensional, albeit it may be high-dimensional, and not in the space of scenes. While 
the process described above is conceptually straightforward, there are significant com- 
putational challenges in aggregating distributions from samples in high dimensions. 
This is also an active area of research beyond the scope of this manuscript. 



136 



CHAPTER 9. LEARNING PRIORS AND CATEGORIES 




Figure 9.3: Van sequence from [ ]; the top two rows show the result of the P/N 
tracker [98] on frames 2, 40, 50, 60, 85, and 140, beyond which the tracker fails to 
recover. Similarly, [207] fails towards the end of the sequence (middle two rows): The 
superpixels labeled as positives are shown in color superimposed the darkened image. 
One can see that towards the end, the true positives have leaked to include the road, and 
the tracker has drifted. The last three rows show the state estimated with [207], and the 
last row shows the corresponding tracks x\. Red tracks are the true positives, blue are 
the false positives, yellow are true negatives. 
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Figure 9.4: (Left) y. (Right) c. The conditional distribution of pixel intensities yi inside 
the square (ft) is identical to the one outside. However, restricting the classification to 
subsets ofQ makes the class-conditionals easy to separate. 
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Chapter 10 

Discussion 



Physical processes contributing to image formation occur at a level of granularity that 
is infinitesimal relative to the sampling capacity of optical sensors. And yet, we seem 
to take for granted that the epistemological process requires breaking down the data 
into elementary "atoms." Conventional information theory suggests that such a sym- 
bolization process would occur at a loss, and that integrated systems capable of sensing 
and action would be best designed in an end-to-end fashion. 

However, we have seen that nuisance factors such as illumination and viewpoint 
changes account for almost all the variability in the data, and therefore the process of 
eliminating their influence in the data can lead to a lossless symbolization. In other 
words, under certain circumstances, it is possible to throw away almost all the data, 
and yet none of the information. 

Furthermore, nuisances that are not invertible, such as occlusions, can be factored 
out through a controlled sensing action. Indeed, control is the "currency" that trades 
off performance in a perception process. The more control an agent can exercise on the 
sensing process, the tighter the bound that can be guaranteed on the reduction of the 
Actionable Information Gap. 

The models we describe in this manuscript are idealized abstractions. Nevertheless, 
an abstraction is useful to guide investigation and to evaluate existing schemes. In fact, 
one of the many possible objections to our program is that the models we used in our 
analysis (for illumination, deformation, occlusion etc.) are so simplistic as to make 
the formalization exercise futile. Better is to devise new algorithms and test them 
on empirical benchmarks. That is generally true, although an abstraction allows us 
to ascertain how different algorithms are related, and determine not just whether one 
works better than another, but why. Shannon's sampling theorem is valid only for 
strictly band-limited signals, a wildly idealized abstraction of real signals. And yet, it 
provides useful guidelines for the design of certain algorithms, for instance for audio 
processing. 

Remark 16 (Limitations of the symbolic notation) The notation h(g,£,v) is meant 
to separate the role of invertible (group) nuisances g and non-invertible nuisances v. 
However, non-invertible nuisances such as occlusions can arise from the interplay of 
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group transformations g and acting on the scene £. For instance, a non-convex scene 
can generate self-occlusions via the action of a planar translation, which is a group. 
This is why we call the entity h(g : £, 0) a "hypothetical" image, because in practice it is 
not possible to act on the scene £ through a group g without generating non-invertible 
nuisances. Instead, in general we have v = v(£,g), so non-invertible nuisances can 
appear through the action of a group acting on the scene. Thus also the notation logov 
and the composition of group actions and nuisances is inconsistent unless we take into 
account this dependency through a more elaborate notation, referring to an explicit 
model such as one of those described in Appendix B.l. As a consequence, the notion 
of commutativity has also to be specified with respect to a specific image -formation 
model, rather than the generic symbolic model. 

This can be confusing at first reading, but the fact that non-invertible nuisances 
cannot be dealt with in pre-processing, and that reducing the resulting information 
gap requires control of the sensing process, remains, and is already illustrated by the 
simplified formal notation. 

Another valid objection is that we have reduced visual perception to pattern clas- 
sification. We have not tackled high-level vision, perceptual organization, and all the 
high-level models that lie at the foundations of our understanding of higher visual pro- 
cessing. This does not imply that these problems are not important. We have just 
chosen to focus on the lowest level, which is the conversion from analog signals to 
discrete entities, on which high-level models can be built. 

So far we have been deliberately vague about the difference between information 
and knowledge. We believe knowledge acts on information, but also needs tools to 
manipulate representations, including counterfactuals and causal analysis [ ]. Nev- 
ertheless, most investigations that we are aware of take a discrete, atomic representation 
as a starting point, and fail to bridge the "gap" required to explain why we need such a 
representation in the first place. We hope to have addressed this in a number of ways. 
First, by showing that even for invertible nuisances, invariants can be a set of measure 
zero [ ]. Second, non-invertible nuisances call for breaking down the image into 
pieces (segments); [ ] show that to recognize an object with a viewpoint invariant 
feature you have to discard its shape. Interestingly, this was already understood by 
Gibson, who expressed it in words: ([ ], page 271): 

"Despite the argument that because a still picture presents no transforma- 
tion it can display no invariants under transformation, [in Gibson, 1973] I 
ventured to suggest that it did display invariants". 

We have shown that what what is left after discounting viewpoint variability is a set 
of measure zero relative to the image: Almost all the data is gone; what is left is 
information. From there, knowledge can be built by induction or deduction, by logic 
or probabilistic inference, or a combination thereof as in Bayesian inference. But the 
crucial step is how precisely data are discarded to arrive at information. 

We have also shown that, in the absence of specific modeling of the nuisances, 
one would have to marginalize them as part of the matching process, which entails 
infinite-dimensional optimization. This does not mean that one cannot obtain positive, 
even impressive, results on a specific domain, for instance in semi- structured or fully 
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structured environments where the variability due to nuisances is kept at bay. However, 
one should not extrapolate the optimism stemming from success on a specific domain 
with the solution of the general problem. 

10.1 James Gibson 

The notion of Actionable Information presented in this manuscript is closely related 
to the notion of Information proposed by Gibson, although he never formalized these 
notions; in words, however, page 245 of [66] recites 

"The hypothesis that invariance under optical transformation constitutes 
information for the perception of a rigid persisting object goes back to the 
moving-shadow experiment (Gibson and Gibson, 1957)" 

And again on page 310: 

"Four kinds of invariants have been postulated: those that underlie change 
of illumination, those that underlie change of the point of observation, 
those that underlie overlapping samples, and those that underlie a local 
disturbance of structure. [...] Invariants of optical structure under chang- 
ing illumination [...] are not yet known, but they almost certainly involve 
ratios of intensity and color among parts of the array. [...] Invariants [...] 
under change of the point of observation [...] some of the changes [...] are 
transformations of its nested forms, but the major changes are gain and 
loss of form, that is, increments and decrement of structures, as surfaces 
undergo occlusion. [...] The theory of the extracting of invariants by a vi- 
sual system takes the place of theories of "constancy" in perception, that is, 
explanations of how an observer might perceive the true color, size, shape, 
motion and direction-from-here of objects despite the wildly fluctuating 
sensory impressions on which the perceptions are based." 

The line of the program sketched by Gibson in his theory of "information pickup" 
where "the occluded becomes unoccluded" is very closely related to the notion of in- 
vertibility of the nuisance and controlled recognition that we discuss in Chapters 8 and 
8.4. 

10.2 Alan Turing 

Alan Turing is perhaps the researcher that showed the deepest insight into the questions 
raised in this manuscript. On one hand, he specifically addressed the rise of "discontin- 
uous behavior" in continuous chemical systems through reaction-diffusion processes 
[190]. On the other hand, he addressed the issue of intelligent behavior in machines. 
While Turing's theory of morphogenesis, despite its limits, can be taken as sufficient 
evidence that biological systems may evolve towards discrete structures, it does not 
provide evidence of the need to organize measured data into discrete "information en- 
tities." 
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Since in building machine vision systems we are not constrained by the chemistry 
of reaction-diffusion, Turing's theory remains incomplete. In particular, it does not 
address why a data-processing system (biological or otherwise) built from optimality 
principles should exhibit discrete internal representations rather than a collection of 
continuous input-output maps. This "gap" thus remains open in Turing's work. Indeed, 
it is summarily dismissed: 

"the confusion between [analog and digital machines] is to be ignored. 
Strictly speaking, there are no [digital] machines. Everything really moves 
continuously. But there are many kinds of machine which can profitably 
be thought of as being discrete- state machines. For instance in consider- 
ing the switches for a lighting system it is a convenient fiction that each 
switch must be definitely on or definitely off. There must be intermediate 
positions, but for most purposes we can forget about them." 

He therefore moved on to characterize "intelligent behavior" in terms of symbols, as 
convenient to sustain the discourse, without regard to how such symbols might come to 
be in the first place. Digitization is indeed an abstraction. It is, however, an abstraction 
that "destroys" information, in the classical sense (Section 2.4.1). And yet, it seems to 
be a necessary step to knowledge or "intelligence". 1 

10.3 Norbert Wiener 

As we have remarked earlier, despite anticipating the role of information in making a 
decision, Wiener remains anchored to the notion of information as entropy. To be fair, 
this was revolutionary at the time, and indeed Wiener is credited as one of the pro- 
posers of using notions from statistical mechanics, including entropy, in the processing 
of signals. Part of the problem is that the task Wiener was implicitly considering is the 
reconstruction of a signal under noise. This is not surprising since so much of Wiener's 
work was devoted to Brownian motions and to the characterization of stochastic pro- 
cesses driven by "noise". 

It is interesting, however, to note that Wiener had the intuition that transformations 
play an important role, including the notion of the invariant under a group. On page 
135 of [204], he remarks about the ability of the human visual system to recognize 
line drawings, pointing the attention to discontinuities. He also introduces the notion 
of "group scanning" (what we have called max-out, or registration), hypothesizing 
computational hardware in the brain that could be implementing such an operation 
(page 137). Indeed, he suggests that McCulloch's apparatus could perform such a 
group- scanning. He even introduces the first moment as an invariant statistic to a group 
in equation (6.01) on page 138, and called it a gestaltl It is unfortunate that Wiener did 
not further elaborate on these points. 

Wiener also implies some of the seeds of ecological vision, mixed by the notion of 
invariant statistics, suggesting that 

x Note that Godel's assertion of the limitations of logic (1931) is irrelevant in this context, as our argument 
is not in favor of logic, but about the necessity of a discrete/finite internal representation. This could be used 
as an approximation tool, and continuous techniques be used for inference rather than logic. 
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"we tend to bring any object that attracts our attention into a standard 
position and orientation, so that the visual image which we form of it varies 
within as small a range as possible." 

This would be equivalent to "physical canonization" which, as Gibson suggested, can 
always be performed even when the nuisance is not invertible from a single image. 
Wiener does not explain, however, how his notion of information is compatible with 
his hypothesis that 

"processes occur [...] in a considerable number of stages each step in 
this process diminishes the number of neuron channels involved in the 
transmission of visual information" 

a sentence that reveals both the attachment to transmission of information as the un- 
derlying task, and the notion of "compression" as information that is, however, not 
developed further. 

10.4 David Marr 

This manuscript could be interpreted as an attempt to frame the ideas of Marr into 
an analytical framework, although Marr did not frame the questions he asked in the 
context of a task. 

"Our view is that vision goes symbolic almost immediately, right at the 
level of zero-crossings, and the beauty of this is that the transition [...] is 
probably accomplished without loss of information" 

This statement, from [127], is a incorrect if one means "information" in the sense of 
Shannon [164]. However, it is precisely correct if one is to take the approach described 
in this manuscript. The use of zero-crossings was ultimately rejected because the de- 
coding process was unstable; however, as we have argued, an internal representation 
is not needed for reconstruction; instead, for the task of recognition, an internal repre- 
sentation is necessary, and zero-crossings (a special case of feature detector F in our 
parlance) might not have been a bad idea after all. 2 

10.5 Other related investigations 

Donald L. Snyder used to often purport his mother's advice to "never throw away 
information." Our discussion reveals that any sensible notion of information that relates 
to the recognition task (as opposed to transmission) requires data (not information) to 
be thrown away. In other words, to gather information you must first throw away data, 
the more the better. So, what is "lossy" for image compression or data transmission is 

2 Marr's theories pertains not to vision in general, but to visual recognition. There is no need for an 
internal representation (such as that afforded by the primal sketch, the 2-1/2D sketch and the full sketch) for 
navigation, 3-D reconstruction, rendering or control. It is interesting that Marr uses stereo, and in particular 
Julesz' random dot stereograms, to validate his ideas, where a fully continuous algorithm that acts directly 
on the data would do as well or better. 
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not lossy for recognition, and in general for understanding images. This seems at first 
to defy any notion of traditional information theory and decision theory, for it involves 
multiple intermediate decisions. We hope that our arguments have convinced the reader 
that this is not the case for the theory developed in this manuscript. 

Some of the statements made in this manuscript may seem controversial. Certainly 
I do not wish to imply that individual organisms that do not exhibit visual recognition 
(e.g. blind people) do not have intelligence. Nor do I imply that every organism that has 
a visual system exhibits intelligent behavior. What I have argued is that organisms and 
species that need efficient visual recognition (limitations on the classifier to maximize 
computational efficiency at decision time) benefit from signal analysis (and therefore 
symbolic manipulation, internal representation etc.) 

It is interesting to speculate whether there are organisms that have vision, and that 
use vision for regression or control tasks (e.g. visual navigation) but not to perform 
decisions (e.g. visual recognition). For instance, it is interesting to speculate whether 
the fly navigates optically, but decides olfactorily. 3 To this end, there is evidence that 
the fly lacks the wide-ranging lateral connections exhibited in higher mammals. Marr's 
account on the fly's visual system (and the so-called "representation" that it implies, 
page 34 of [126]) really describes a collection of analog input-output maps, or collec- 
tion of sensing-action maps, with a simple decision switch, with no need for an internal 
representation. Wiener points out that 

"complicated as the behavior patterns of birds are - in flying, in courtship, 
in the care of the young, and in nest building - they are carried out cor- 
rectly on the very first time without the need of any large amount of in- 
struction from the mother." 

He uses this argument in support of phylogenic (species, as opposed to ontogenic, or 
individual) learning, and speaks in favor of endowed input-output maps without an 
underlying internal representation that is easily manipulated. A contrarian view on 
the topic is supported by experiments in the development of an individual organism's 
vision system in the absence of mobility [77]. 

Because mobility plays such an important role in this manuscript, it naturally relates 
to visual navigation and robotic localization and planning [182, 16, 82]. In particular, 
[202, 26, 175] propose "information-based" strategies, although by "information" they 
mean localization and mapping uncertainty based on range data. Range data are not 
subject to illumination and viewpoint nuisances, which are suppressed by the active 
sensing, i.e. by flooding the space with a known probing signal (e.g. laser light or radio 
waves) and measuring the return. There is a significant literature on vision-based navi- 
gation [29, 216, 143, 147, 179, 169, 60, 48, 163, 95] that is relevant to occlusion-driven 
navigation [110, 111, 12]. In most of the literature, stereo or motion are exploited to 
provide a three-dimensional map of the environment, which is then handed off to a 
path planner, separating the photometric from the geometric and topological aspect of 
the problem. This separation is unnecessary, as the regions that are most informative 

3 In a personal communication, Steve Zucker remarked that the fly does not exhibit long-range lateral 
interactions, and yet it is known to use vision for navigation, and have strong olfactory system to guide 
actions. 
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are occlusions, where stereo provides no disparity. Another stream of related work is 
that on Saliency and Visual Attention [ ], although there the focus is on navigating 
the image, whereas we are interested in navigating the scene, based on image data. In 
a nutshell, robotic navigation literature is "all scene and no image," the visual attention 
literature is "all image, and no scene." The gap can be bridged by going "from image 
to scene, and vice-versa" in the process of visual exploration (Chapter 8). The relation- 
ship between visual incentives and spatial exploration has been a subject of interest in 
psychology for a while [34]. 

This work also relates on visual recognition, by integrating structures of various 
dimensions into a unified representation that can, in principle, be exploited for recog- 
nition. In this sense, it presents an alternative to [74, 184], that could also be used to 
compute Actionable Information. However, the rendition of the "primal sketch" [126] 
in [74] does not guarantee that the construction is "lossless" with respect to any partic- 
ular task, because there is no underlying task guiding the construction. This work also 
relates to the vast literature on segmentation, particularly texture- structure transitions 
[208]. Alternative approaches to this task could be specified in terms of sparse coding 
[144] and non-local filtering [ ]. This paper also relates to the literature of ocular 
motion, and in particular saccadic motion. The human eye has non-uniform resolution, 
which affects motion strategies in ways that are not tailored to engineering systems 
with uniform resolution. One could design systems with non-uniform resolution, but 
mimicking the human visual system is not our goal. 

Our work also relates to other attempts to formalize "information" including the 
concept of Information Bottleneck, [183], and our approach can be understood as a 
special case tailored to the statistics and invariance classes of interest, that are task- 
specific, sensor-specific, and control authority- specific. These ideas can be seen as 
seeds of a theory of "Controlled Sensing" that generalizes Active Vision to different 
modalities whereby the purpose of the control is to counteract the effect of nuisances. 
This is different than Active Sensing, that usually entails broadcasting a known or 
structured probing signal into the environment. Our work also relates to attempts to 
define a notion of information in statistics [120, 20], economics [128, 4] and in other 
areas of image analysis [100] and signal processing [68]. Our particular approach to 
defining the underlying representational structure relates to the work of Guillemin and 
Golubitsky [ 7]. 

Last, but not least, our work relates to Active Vision [1, 21, 11], and to the "value 
of information" [128, 59, 70, 41, 35]. The specific illustration of the experiment to the 
sub-literature on next-best- view selection [150, 12]. Although this area was popular 
in the eighties and nineties, it has so far not yielded usable notions of information 
that can be transposed to other visual inference problems, such as recognition and 3D 
reconstruction. A notable exception is the application of active vision to the automotive 
environment, pioneered by Dickmanns and co-workers [51, 52, 50]. 

Similarly to previously cited work [202, 26], [49] propose using the decrease of un- 
certainty as a criterion to select camera parameters, and [ ] uses information-theoretic 
notions to evaluate the "informative content" of laser range measurements depending 
on their viewpoint. Other influential literature on the relation between sensing and 
action include [145, 69]. 

This manuscript of course also relates to data compression, in particular video com- 
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pression, but not in the traditional sense, where the goal is to reconstruct a copy of the 
signal on the other side of the channel, but where the goal is to transmit a compressed 
representation that is to be used for decision or control tasks. 

10.6 The apology of image analysis 

The favorite modern paradigm for "understanding" physical phenomena 4 hinges on 
reducing them to their elementary components, a form of data compression. For in- 
stance, to "understand" how a radio works, one would take it apart, i.e., analyze it, into 
ever smaller pieces until some recognizable elements are found. These "atoms" will 
then need to be re-assembled to establish their context, or relations. Taking a computer 
apart is far less enlightening. One can see the "metabolic" components of the device - 
the power supply, the fans - but once at the level of the microprocessor, even complete 
understanding of the basic element, the transistor, sheds little light on what the device 
may do, and how. A future archeologist, having managed to unearth a computer, would 
be hard-pressed to infer the past existence of spreadsheets, email, or computer games. 

The same goes for the brain. Understanding the anatomy and physiology of a 
single neuron, and even the connectivity of various neural structures, helps us little 
in understanding how the brain represents, stores, and processes information, whatever 
"information" may be. One thing we do know from the neurophysiology of early visual 
processing starting at the retina, and from acoustic processing starting at the cochlea, 
is that the brain processes sensory data by first "breaking it down into pieces." For 
instance, neurons in visual cortex VI only respond to stimuli in certain regions of the 
retina (receptive fields). So, the brain does seem to break signals apart, to analyze 
them. But "is data analysis necessary for knowledge?" To answer the question, one 
would have to first define "analysis" and "knowledge." For "analysis," I have already 
indicated this in Footnote 3 in the Preamble. For "knowledge," unfortunately, I do 
not have a proper definition and I would not attempt making one up. In general one 
imagines knowledge arising from manipulating "information," which would shift the 
burden to defining "information." Unfortunately, most definitions of "information" 
currently in widespread use, for instance the classical notion of entropy proposed by 
Wiener and elaborated by Shannon, fail to adequately describe the complexity of visual 
perception. 

What can be easily defined and measured is the performance in a "task." A task 
is any measurable action performed by an agent (human or machine). This may be 
reaching for a fruit, initiating a movement (physical actions) but also recognizing a 
predator. Rather than choosing the most general task, say "survival," we focus on 
the most specific and simplest possible task. 5 In general, a task entails a decision, 
and performing tasks is not by itself a trademark of what one would call "intelligent 
behavior." A plant performs plenty of tasks (grow, sprout, branch out, turn towards 
the sun etc.), so does a worm, a cockroach, a fly, etc. Some of these organisms exhibit 

4 Reichenbach, for instance, spoke of "isolating factors" as the premise for knowledge [158]. 

5 Marr [ ] used the generality of the task to advocate his approach (page 32): "Vision [...] is used in 
such a bewildering variety of ways [...] can the type of formulation that I have been advocating [...] possibly 
prove adequate for them all? I think so." 



1 0. 6. THE APOLOGY OF IMAGE ANALYSIS 



147 



what we would call "automatic behaviors" or "reactive behavior", in the sense that they 
respond to a sensed signal. What is common to all tasks is that they require sensory 
data, and they all involve a decision, even if as simple as the decision to initiate an 
action. 

So, we start from the assumption that information, however one wishes to define 
it, must be tied to some form of sensing or measurement process, that generate what 
we call "data". The data is aggregated in the learning process. In this manuscript, we 
do not distinguish phylogenic learning (species) from ontogenic learning (individual): 
From a mathematical standpoint they are equivalent. 6 

We characterize "information" to be specific for an agent undertaking a specific 
task, as whatever portion or function of the data is useful to accomplish the task. 1 The 
fact that information is task-dependent is missing in traditional information theory, 
which is fine when the underlying task is reconstruction of the "source" signal. 

As we have argued in the Preamble, physical phenomena are essentially contin- 
uous* or at least they exist at a level of granularity that is infinitesimal compared to 
that of the object of inference. So an agent performing a task can be thought of as a 
black box that takes as input some continuous function, and spits out an action, which 
could live in the continuum (e.g. a motion within the environment) but usually entails 
a decision (e.g. initiating the motion), or just a decision itself (is the spotted cat in front 
of me dinner food or vice- versa?). 

Clearly not all data is useful for the decision: Whether an animal at close range is 
going to eat us or not does not depend on the spectral characteristics of the light, but 
the latter affects our measurements of the animal nevertheless. So, in defining informa- 
tion we have to exercise care to make sure that it does not depend on such "nuisance 
factors " that hinder our performance in accomplishing the task (decide quickly, run if 
needed). 

Once we narrow down the task to a decision, we might embrace the safe confines of 
statistical decision theory. Unfortunately, at the outset the Data Processing Inequality 
tell us that the best thing is to devise a scheme that takes us directly from the sensed 
data to the final task, without intermediate decisions. This means that the most success- 
ful organisms ought to be "automata" that have a series of pre-programmed "sensing- 
action" behaviors triggered by sensed data. No breaking down the sensed data into 
pieces, since that would correspond to an increase in expected risk. 

So why should we break down the radio? Why should the brain break down 9 sen- 

6 They are, of course, different in the way they are implemented, as the former can be encoded at the 
neuronal level, while the latter must be encoded in the genome. 

7 It is interesting that Marr [ ] had coupled the notion of information with utility for the task, but 
somehow did not manage to extricate them. On page 31: "Vision is a process that produces from images of 
the external world a description that is useful to the viewer and not cluttered with irrelevant information." 
Note that we would use "irrelevant data" as opposed to "irrelevant information", as the latter is rendered an 
oxymoron by our definition. 

8 Continuity is of course an abstraction, as Pythagoras discovered trying to measure the hypothenuse 
of a triangle with integer sub-divisions of one of its sides. This issue has been settled by the first scientific 
revolution and we do not feel the need to further elaborate it here, other than to re-iterate that scale invariance 
of natural image statistics are what makes the continuous limit relevant in vision. It certainly is not necessary 
in Communications, where one wants to create a copy of the signal emitted by the source, precisely at the 
same scale. 

9 One could argue that a digital signal is already broken down, or "quantized" into pieces, and therefore 
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sory data? Why should we build an "internal representation" of the world by assem- 
bling the pieces, or "atoms," of the sensed data? Why would we ever evolve to acquire 
"symbols" (which entail intermediate decisions) instead of simply learning automatic 
behaviors? 

Cognitive Science and Epistemology deal with perception as combinations of "to- 
kens" and knowledge as operating on "symbols." But there is no notion as to why such 
"symbols" arise. 10 Is intelligence not possible in an analog setting? Can a fly exhibit 
intelligent behavior? Or is it just an automaton? 

Breaking down the data into "atoms" or what we refer to as "segmentation," thought 
of as a process of creating order, conflicts with the basic tenets of classical physics. This 
also exposes the inadequacy of entropy as a measure of information: A segmentation 
of the image, if done properly, not only does not reduce the amount of information 
it contains, but it is the beginning of the very process of performing an (informed) 
decision from it. The entropy of the segmented image, however, is smaller than the 
entropy of the data. Wiener connected some dots to justify the process in the context of 
meta-stable systems, noting that if we were to just follow the diktat of thermodynamics 
we would only be concerned with stable systems, and the stable state of every living 
organism is to be dead. But the bottom line is that explaining the crucial passage from 
data to symbols has been largely overlooked, and doing so requires at least as much 
attention as making sense of thermodynamics in non-equilibrium systems. 11 

So, the Data Processing Inequality (as well as mechanistic empiricism, and modern 
statistical machine learning) seem to indicate that the best one can do is to process all 
the data in one go, in order to arrive at the best decision. It does not matter that some of 
the data may be useless for the task. However, although there may be no advantage in 
data analysis, one could compute statistics (functions of the data) that keep the expected 
risk constant, and have other advantages, for instance speeding up the execution of a 
task. If we factor in some form of "complexity" or "efficiency", does a suitable notion 
of "information" suddenly arise? If so from what principle? In general, complexity 
calls for compressing the data, not for breaking it down into pieces. How does the 
structure of the environment play a role? 

The "complement" to information within the data is what we call "nuisances." 
These are phenomena that affect the data but not the decision. For instance, when 
we spot a predator standing in front of us, the reflectance properties of surrounding 
objects and the ambient illumination affect our sensory perception, but are useless as 

the question we address in this manuscript is moot. However, quantization is not analysis: It does not depend 
on the signal being sensed. One could argue that, for digital signals, since the data is already segmented into 
atoms, the epistemological process is one of integration, not analysis. In this case, however, one would have 
to decide (again, an intermediate decision) whether the integration process grows to encompass the entire 
domain of the data, or whether it stops in some local region, in which case the integration process is just 
another way of implementing the analysis of the signal. 

10 Hume, however, talked about "impressions" (sensory perceptions) and their relation to "ideas" (internal 
representation), but unfortunately with the analytical tools available in 1750 he could not get beyond a vague 
notion of "vivacity" as the differentiator between the two (the "idea" of the taste of an orange is not as tasty 
as the taste of an orange itself). 

11 It is interesting to note that Wiener did include a "decision" task in his definition of information ([204], 
page 61), and he did talk about groups and invariants in the context of information processing ([204], page 
50). However, he could not connect the two because the only "nuisance" that he considered was additive 
noise, and therefore his characterization of information remained bound to entropy. 



1 0. 6. THE APOLOGY OF IMAGE ANALYSIS 



149 



far as helping our decision of whether to flee or stay. Among nuisances, some are in- 
vertible, in the sense that their effect in the data can be eliminated without knowledge INVERTIBLE NUISANCE 
of the task, and some are not. The ones that are invertible can be eliminated with- 
out decreasing the expected risk of the decision at hand, thereby reducing complexity. 
However, eliminating invertible nuisances does not necessarily require analysis, i.e. it 
does not require breaking down the data into pieces. It should be re-emphasized that 
the data collection process can be changed to make non-invertible nuisance invertible. 
For instance, spectral sampling (color) can make cast shadows, that are a visibility ar- 
tifact and therefore subject to the same statistics as occlusions, invertible. Similarly, 
motion can make occluding boundaries, that are not observable in a single image, ob- 
servable. One could argue that in Gibson's ecological optics framework, all nuisances 
are invertible, and the epistemological process is precisely the process of exploration of 
the environment [ ]. Similarly, it is the process of managing non-invertible nuisances 
that spawns the opportunity, if not the necessity, of an internal representation. This 
process does not necessarily need to be carried out by the individual, but can instead 

occur through exploration during the evolution process. COULD INTELLIGENT BEHAVIOR HAVE EVOLVED WITHOUT 

VISUAL PERCEPTION^ 

It is the dealing with non-invertible nuisances that forces image analysis. Occlu- 
sions cannot be "undone" in one image. Test and template must be compared. But in 
order to do so efficiently one wants to compute all possible sufficient statistics, so that 
the comparison is reduced to a simple selection of the most discriminative statistics. 
This is the only case where we can find a compelling reason for data analysis. Optical 
sensing and the structure of the environment play a crucial role because of visibility 
artifacts, that cause occlusion of line of sight or cast shadows. 

So we argue that intelligent behavior, that hinges on a "discrete" internal represen- 
tation, stems when an agent or species exhibits the ability to perform decisions based 
on optical sensing, or in any case sensing that entails occlusion phenomena. This would 
not include acoustic sensing 12 or chemical ones. Clearly once the capability (and an 
architecture) to perform signal analysis is evolved, an individual or species can oppor- 
tunistically use it for other purposes, for instance to analyze acoustic or chemical data, 
but that capability might not have evolved if that was the only form of sensing in the 
first place. 



Although the Cochlea does perform some sort of Fourier analysis; sounds compose additively. 
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Appendix A 

Background material 



A.l Basics of differential geometry 

Some of the concepts discussed in this manuscript use concepts from differential ge- 
ometry, including the notions of group, orbit space, equivalence classes, quotients and 
homogeneous spaces. A good introduction to this material is [ ]. While an expository 
review of differential geometry is beyond the scope of this appendix, most of the con- 
cepts treated in this manuscript can be understood after going through Appendix A of 
[140], on pages 403-433 (skipping Section 1.3). Some of that material, specifically re- 
lating to the Lie groups SE(3) and 50(3), and their corresponding Lie algebra se(3), 
can also be found in Chapter 3 of [125]. Our notation in this manuscript is consistent 
with, and in fact derived from, both [ ] and [140]. 

A.2 Basic topology (by Ganesh Sundaramoorthi) 

In this appendix we describe some basic notions of topology that are useful to follow 
the orange-colored sections of the manuscript. We have privileged simplicity to rigor, 
so some of the statements made are imprecise and others are not fully developed. The 
interested reader can consult a differential topology book for details and clarifications. 

A.2.1 Morse Functions 

We consider functions / : R 2 — >> R + as models for images. Define the gradient as 




and define the Hessian as 
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Define a level set of a function at level a £ R + as 



f- 1 (a) = {xeR 2 : f(x) = a}. 



Definition 13 (Critical Point) For a C 1 function, a critical point p £ R 2 is a point 
such that V/(p) = 0. 

• A critical point p is a local minimum if 3 5 > such that f(x) > f(p) for x 
such that \x — p\ < 5. 

• A critical point p is a local maximum if 3 5 > such that f(x) < f(p) for x 
such that \x — p\ < 5. 

• A critical point p is a saddle if it is neither a local minimum or local maximum. 

Definition 14 (Morse Function) A Morse function / is a C 2 function such that all 
critical points are non-degenerate, i.e., ifp is a critical point, then det V 2 f(p) ^ 0. 

Remark 17 By Taylor's Theorem, we see that Morse functions are well approximated 
by quadratic forms around critical points: 



provided that f(p) =0 (if not, set f to f — f(p) ). In particular, this means that Morse 
functions have isolated critical points. 

The previous Remark, leads to the following observation: 

Theorem 10 (Morse Lemma) If f is a Morse function, then for a critical point p of 
f, there is a neighborhood U 3 p and a chart (coordinate change) ip : U C IR 2 — >> 
U C R 2 so that 



where (xi, x 2) = ^(#1, #2) and (#1, #2) £ ^ 2 are the natural arguments of f. 

Remark 18 Morse originally used the previous theorem as the definition of Morse 
functions. This way, a Morse function does not need to be differentiate. 



fix) = Sip) + V/(p) -[x-p) + [x- p) T V 2 fip)ix -p)+ o(\x - p\ 2 ) 
= (x — p) T V 2 f(p)(x — p) + o(\x — p\ 2 ) 




(x\ + x 2) ifp is a maximum 
I + x\ ifp is a minimum 
I — x\ ifp is a saddle 



A.2.2 Examples of Morse/non-Morse Functions 

Examples of Morse/Non-Morse functions are the following: 

• Obviously, f(x\,x<2) = x\ + x\ and f(xi,X2) = x 2 — x\ are Morse Functions. 
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Figure A. 1 : Morse's Lemma states that in a neighborhood of a critical point of a Morse 
function, the level sets are topologically equivalent to one of the three forms (left to 
right: maximum, minimum, and saddle critical point neighborhoods). 

• The height function f : S — >• M 2 of the embedding of a compact surface S C IR 3 
without boundary is a Morse function. For example, the height function of the 
torus: 



• /(xi, #2) = x \ + x 2 i s not a Morse function (degenerate critical point). 

• A monkey saddle, i.e., f(xi, X2) = x\ — Zx\x\\ 



is not a Morse function (degenerate; level sets of saddle must cross at an 4 X'). 

• All non-smooth functions are not Morse functions, e.g. images that have edges! 

• Functions that have co-dimension one critical sets (e.g. ridges and valleys) are 
not Morse functions. Such critical sets are commonplace in images! 

A.2.3 Morse Functions are (Almost) all Functions 

Morse functions seems to be a very restricted class of functions from the previous 
examples, however, a basic result from Morse Theory says otherwise. 

Definition 15 (Dense Subset of a Normed Space) Let || • || denote a norm on a topo- 
logical space X. A set S C X is dense if closure (S) = X, i.e., for every x G X and 
S > 0, there exists s G S such that \\s — x\\ < S. 
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Theorem 11 (Morse Functions are Dense) Let || • || denote the C 2 norm on the space 
of C 2 functions, i.e., 

H/ll = sup \f{x)\ + |V/(z)| + |V 2 /(z)l- 

x£R 2 

Then Morse functions form an open, dense subset of C 2 functions. 

Remark 19 By the previous theorem, any smooth (C 2 ) function can be well approx- 
imated by a Morse function up to arbitrary precision (defined by || • \\). For exam- 
ple, ridges and valleys can be approximated with a Morse function, e.g. consider the 
following circular ridge that can be made Morse by a slight tilt. Let f(x\,X2) = 
exp {—{\/x\ + x 2 — l) 2 ) which is a ridge and non-Morse: 



Now consider the function g(x\, £2) = #2) + £X\, which is arbitrarily close to 
f (in C 2 norm) and is a Morse function: 



It is a basic fact that C 2 functions under the norm above are dense in all square inte- 
grable functions (L 2 ) functions. Therefore, even non- smooth functions can be approx- 
imated to arbitrary precision by Morse functions. 

A.2.4 Reeb Graph 

Definition 16 (Equivalence Relation) Let X be a set. An equivalence relation on X 

denoted ~ is a binary relation with the following properties: for all x,y,z G X: 

• (reflexivity) x ~ x 

• (symmetry) if x ~ y then y ~ x 

• (transitivity) ifx^y and y ~ z, then x ~ z 

We denote by [x] all elements of X related to x, e.g. [x] = {y G X : x ~ y}. 

Definition 17 (Topological Space) A topology denoted T on a set X is a collection 
of subsets of X (called open sets) such that the following properties hold: 

• 0,XgT 

• for U a G T where a G J is an index set (perhaps uncountable), we have 

• for Ui G T where i G X is a finite index set, we have C\ ieX Ui G T. 
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Definition 18 (Quotient Space) Let X be a topological space. Let <~ denote an equiv- 
alence relation on X. The quotient space of X under the equivalence relation ~, 
denoted X/ ~ is the topological space whose elements are 

X/ ~={[ x ] : xeX}, 

and whose topology is induced from X. The quotient map is the (continuous) function 
7r : X X/ ~ defined by ir(x) = [x]. 

Definition 19 (Reeb Graph) Let f : M 2 ^ R be a function. Define a equivalence 
relation ~ on the space Graph(f) = {(#,/(#)): x G R 2 } by 

(x, f{x)) ~ (y, f(y)) Wf(x) = /(y) and there is a continuous path from x to y in f~ 1 (f(x)). 

The Reeb graph of the function f, denoted Reeb(f), is the topological space Graph(f) / ~. 

Remark 20 The Reeb graph of a function f is the set of connected components of level 
sets of f (with the additional information of the function value of each level set). 

A.2.5 Examples 

We will depict the Reeb graph in the following way: an element [(x, f(x))] G Reeb(/) 
will be represented by a point P[( x j( x ))] in the x — y plane, and if f(zi) > f(z 2 ) then 
the ^/-coordinate of p [{zi j {zi) )] will be larger thanp [( ^ 2 j( Z2 ))]. 

• f(x u x 2 ) =x\+x\ 



• f(x u x 2 ) = exp [-{x\ + x 2 2 )\ + exp [-{(xi - l) 2 + x 2 2 )\ 



• f(xi,x 2 ) = exp[-(x 2 +xl)] -0.1exp[-10(Oi - 0.2) 2 + x 2 ,)] 



• f{x, y) = exp [-{x\ + x 2 )] +exp [-(On - 3) 2 + x 2 2 )] +exp [-((a* + 3) 2 + x 2 )] 
A.2.6 Properties of Reeb Graphs 

Lemma 1 (Reeb graph is connected) /// : R 2 4lw a function, then Reeb(f) is 
connected. 

Lemma 2 (Reeb Tree) If f : R 2 — » R w a function, then Reeb(f) does not contain 
cycles. 
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Remark 21 Both of these results follow from basic results in topology, namely, that 
connectedness and contractibility of loops are preserved under quotienting. That is, 
since Graph(f) is connected and loops in Graph(f) are contractible (so long as f is 
continuous), we have that Reeb(f) = Graph(f)/ ~ must also have these properties. 

Assume now that / : R 2 — » R is a Morse function whose critical points have 
distinct values, then we may associate an attributed graph to Reeb(/). 

Definition 20 (Attributed Graph) Let G = (V,E) be a graph (V is the vertex set and 
E is the edge set), and L be a set (called the label set). Let a : V — »• L be a function 
(called the attribute function). We define the attributed graph as AG = (V,E,L,a). 

Definition 21 (Attributed Reeb Tree of a Function) Let V be the set of critical points 
of f. Define E to be 

E = {(vi, Vj) : i ^ j, 3 a continuous map 7 : [0, 1] — » Reeb(f) such that 

7(0) = [(Vij(vi))], 7 (1) = [(vjj(vj))} and>y(t) + [(v J (v))} for all v e V and all t e (0,1)}. 

(A. 

Let L = R+, and 

a(v) = f(v). 

Definition 22 (Degree of a Vertex) Let G = (V, E) be a graph, and v e V, then the 
degree of a vertex, deg(v), is the number of edges that contain v. 

Theorem 12 Let f : R 2 — >• R + be a Morse function with distinct critical values. Let 
( V, R, /) be its Attributed Reeb Tree. Then 

1. (V,E) is a connected tree 

2. no — rt\ + n<2 = 2 where no is the number of maxima, n\ the number of saddles 
and ri2 the number of minima 

3. IfvEV and v is a local minimum/maximum, then deg(v) = 1 

4. Ifv G V and v is a saddle, then deg(v) = 3 

Remark 22 Property 2 above is a remarkable fact from Morse Theory, which is more 
general than it is shown above. Indeed, given a compact surface S C R 3 , the number 
no — rii + ri2 is the same for any Morse function f : S —> R + , i.e., no — ri\ + ri2 
(although seemingly a property of the function) is an invariant of the surface S. 

Remark 23 Using the fact that for any tree (V, E), we have that \V\ — \E\ = 1 and 
Property 2, we can conclude by simple algebraic manipulation that deg(v) = 3 for a 
saddle. 

Theorem 13 (Stability of Attributed Reeb Tree Under Noise) Let f : R 2 R be a 

Morse function and set g £ = f + sh where h : R 2 C 2 . Then for all e sufficiently 

small, g e is Morse and ART(f) = ART(g e ). 
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A.2.7 Diffeomorphisms and the Attributed Reeb Tree 

Definition 23 (Diffeomorphism of the Plane) A function i/j : R 2 — » R 2 is a diffeo- 

morphism provided that Vip(x) and Vijj~ l (x) exists for all x G R 2 . 

Theorem 14 (Invariance of ART Under Diffeomorphisms) Let f : M? —> R be a 

Morse function and ip : R 2 — >• R 2 be a diffeomorphism. Then ART(f) = ART(foip). 

Remark 24 Afote £to z/p w <2 critical point of f, then ^~ l {p) is a critical point of 

V(fo^-\p)) = V^-^p^oVf^-^p))) = V^(p)oV/(p) = ifVfip) = 0. 

Therefore the vertex set in the ART of both f and foip are equivalent. Moreover, ifj is 
a continuous path in f~ 1 (f(x)) then ^07 is a continuous path in (/o?/;) _1 (/o?/;(x)), 
as diffeomorphisms do not break continuous paths. Therefore, the edge sets in the ART 
of f and f o ip are equivalent. 

Theorem 15 If /, g : R 2 — >> R are Morse functions with distinct critical values and 
if ART(f) = ART(g), then there exists a monotone function h : R — >> R a^J a 
diffeomorphism i/j : R 2 — > R 2 smc/i ^/ia^ f = h o g o ip. 

Remark 25 By Morse Lemma, we can construct diffeomorphisms tpi around critical 
points, the idea is then to "stitch" these diffeomorphisms up with "patches" to form 
the diffeomorphism if) of interest. 

Theorem 16 (Reconstruction of Function from ART) If (V, E) is a tree such that 
each vertex v G V is of degree 1 or 3, then there exists a Morse function f : R 2 —> R 
such that ART(f) = (V, E). 

Definition 24 (Orbit Space) Let X be a set, and G be a group. 

• G acts on X if each g G G is also g : X — >> X such that 

1. For each g,h G G and x G X, (gh)x = g(hx). 

2. For the identity element e G G, we have ex = xfor all x G X. 

• IfG acts on X, then the orbit of a point x G X is Gx = {gx : g G G}. 

• Define an equivalence relation in X by x ^ y if there exists g G G such that 
gx = y. The orbit space (or the quotient of the action G) is the set X/G = 

{[x] : x G X}. 

Theorem 17 Let T be the set of Morse functions with distinct critical values, % denote 
the set of monotone functions h : R —> R, and W denote the set of diffeomorphisms of 
the plane. Then 

• MxW acts on T through the action : (/i, w)f = h o / o w for h G %, w G W, 
and f G T. 



174 APPENDIX A. BACKGROUND MATERIAL 

• The orbit space T j(^-L x W) = 2? where 2? is the set of trees whose vertices 
have degree 1 or 3. 

Remark 26 The second result above is simply a restatement of the Theorems above. 
Indeed, we can define the mapping ART : T '/(H x W) — » 2F by 

ART([f]) = ART(f), where [/] = {(h,w)f G T : (ft, w) e 7-L x W} 

The function above is well-defined since by Theorem 14, any representative g G [/] 
will have the same Attributed Reeb Tree. Note 

• Theorem 15 states that ART : T j (H x W) — » ^ w injective. 

• Theorem 16 states that ART : T j (H x W) — » ^ w surjective. 

• Therefore, ART : T '/{% xW) -> 2" is a bijection and therefore, T '/(H x W) = 

sr. 



A.3 Radiometry primer (by Paolo Favaro) 

We describe the light source using its radiance, Rl(Qj I), which indicates the power 
density per unit area and unit solid angle emitted at a point q G L in a given direction 
I G M 2 , and is measured in [W/sterad/m 2 ]. This is a property of the light source. When 
we consider the particular direction I from a point q G L on the light source towards 
a point p G 5 on the scene, this is given by g q ^(p — q) = g q p — = g q p. Therefore, 
given a solid angle dft l and an area element dL on the light source, the power per solid 
angle and unit foreshortened 1 area radiated from a point q towards p is given by 

#l(<7, g q p)dtt L (v q ,g q p)dL (A.2) 

where g q p G M 2 is intended as a unit vector. Now, how big a patch dL of the light we 
see standing at a point p on the scene depends on the solid angle dfls we are looking 
through. Following Figure A.2 we have that 

dL = dQ s \\p-q\\ 2 /(v q ,l qP ) (A3) 

where we have defined l qp = q — p/\\q — p\\ and the inner product at the denominator 
is called foreshortening. Similarly, the solid angle d^L shines a patch of the surface 
dS. The two are related by 

dVtL = I, ^ 112 l pq) ( A - 4 ) 

b-^ll 2 

where l pq = — l qp = p — q/\\p — q\\. Substituting the expressions of dQ^L and dL in the 
previous two equations into (A.2), one obtains the infinitesimal power received at the 
point p. 



l lf the area element on the light source is dL, the portion of the area seen from p is given by (u q , g q p)dL; 
this is called the foreshortened area. 
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Figure A.2: Energy balance: a light source patch dL radiates energy towards a surface 
patch dS. Therefore, the power injected in the solid angle dflj, by dL equals the 
power received by dS in the solid angle dfls- Equation (A.4) expresses this balance in 
symbols. 



Now, we want to write the portion of power exiting the surface at p in the direction 
of a pixel x through an area element dS. First, we need to write the direction of x in 
the local reference frame at p. We assume that x is a unit vector, obtained for instance 
via central perspective projection 

tt : R 3 — > § 2 ; p \-> tt(p) = x. (A.5) 

However, the point p is written in the inertial frame, while x is written in the frame of 
the camera at time t. We need to first transform x to the inertial frame, via g*(t)~ 1 x, 
and then express this in the local frame at p, which yields g p ~ 1 g*(t)~ 1 x. We call the 
normalized version of this vector l px (t). Then, we need to integrate the infinitesimal 
power radiated from all points on the light source through their solid angle dfl l against 
the BRDF 2 , which specifies what portion of the incoming power is reflected towards x. 
This yields the infinitesimal energy that p radiates in the direction of x through an area 
element dS: 

R s (p,x)dS(p) = J p(l px (t), g p q)R L (q, g q p)dVt L {q){v q , g q p)dL(q) (A.6) 

where the arguments in the infinitesimal forms dS, dL^dVtL indicate their dependency. 

2 The following equation, which specifies that the scene radiance is a linear transformation of the scene 
radiance via the BRDF is merely a model, and not something that can be proven. Indeed this equation is 
often used to define the BRDF. 
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Now, we can substitute 3 the expression of dQ l from (A.4) and simplify the area ele- 
ment dS, to obtain the radiance of the surface at p 

R S (p,x)= [ P{l px {t),g v q)R L {q, g q p) f q ' 9q ^ 2 (N p , l pq )dL(q) (A.7) 

Since the norm \\p — q\\ is invariant to Euclidean transformations, we can write it as 
Now, if the size of the scene is small compared to its distance to the light, this 
term is almost constant, and therefore the measure 

dE(q,g qP ) = R L (q,g qP ) dL(q) (A.8) 

can be thought of as a property of the light source. Since we cannot untangle the con- 
tribution of Rl from that of dL, we just choose dE to describe the power distribution 
radiated by the light source. Therefore, we have 

Rs(p,x) = J^f3(l px (t),g p q)(N pj l pq )dE(q,g q p). (A.9) 

This is the portion of power per unit area and unit solid angle radiated from a point p on 
a reflective surface towards a point x on the image at time t. The next step consists of 
quantifying what portion of this energy gets absorbed by the pixel at location x. This 
follows a similar calculation, which we do not report here, and instead refer the reader 
to [81] (page 208). There, it is argued that the irradiance at the pixel x is equal to the 
radiance at the corresponding point p on the scene, up to an approximately constant 
factor, which we lump into Rs. The point p and its projection x onto the image plane 
at time t are related by the equations 

x = 7r(g(t)p) p = g(t)- 1 7Ts 1 (x) (A.10) 

where ir^ 1 : S 2 — » R 3 denotes the inverse projection, which consists in scaling x by its 
depth Z(x) in the current reference frame, which naturally depends on S. Therefore, 
the equation below, known as the irradiance equation, takes the form 

I(x,t) = R s (pM9(t)p)) = Rsigity 1 ^ 1 ^)^). (A.ll) 

After we substitute the expression of the radiance (A.9), we have the imaging equation, 
which we describe in Section B.l, Equation (B.7). 



3 Most often in radiometry one performs the integral above with respect to the solid angle dQs> rather 
than with respect to the light source. For those that want to compare the expression of the radiance Rs 
with that derived in radiometry, it is sufficient to substitute the expressions of dL and dQ^ above, to obtain 
R>s(p, x ) — Je 2 P(ipx(t)i 9pQ)Rl{Qj 9qP)(N p , g q p)dfls(p)- In our context, however, we are interested 
in separating the contribution of the light and the scene, and therefore performing the integral on L is more 
appropriate. 
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B.l Basic image formation 

This section spells out the conditions under which the Ambient-Lambert- Static model 
is a reasonable approximation of the image formation process. 

B.1.1 What is the "image" ... 

An "image" is just an array of positive numbers that measure the intensity (irradiance) 
of light (electromagnetic radiation) incident a number of small regions ("pixels") lo- 
cated on a surface. We will deal with gray- scale images on flat, regular arrays, but 
one can easily extend the reasoning to color or multi- spectral images on curved sur- 
face, for instance omni-directional mirrors. In formulas, a digital image is a function 
/ : [0,N X - 1] x [0,N y - 1] [0,N g - 1]; (x,y) H> I(x,y) for some number of 
horizontal and vertical pixels N x , N y and grey levels N g . For simplicity, we neglect 
quantization in both pixels and gray levels, and assume that the image is given on a 
continuum DcR 2 , with values in the positive reals: 

I : D C M 2 4 R + ; x H> I(x) (B.l) 

where x = [x, y] T G M 2 . When we consider more than one image, we index them 
with t, which may indicate time, or generically an index when images are not captured 
at adjacent time instants: I(x,t). We often use time as a subscript, for convenience of 
notation: It(x). This abstraction in representing images is all we need for the purpose 
of this manuscript. 

B.1.2 What is the "scene"... 

A simple description of the "scene", or the "object", is less straightforward. This is a 
modeling task, for which there is no right or wrong choice, and finding a suitable model 
is as much an art as it is a science; one has to exercise discretion to strike a compromise 
between simplicity and realism. We consider the scene as a collection of "objects" 
that are volumes bounded by closed, piecewise smooth surfaces embedded in R 3 . We 
call the generic surface S, which may for convenience be separated into a number 
i = 1, . . . , N Q of simply connected components, or "objects" S = U^S*. The surface 
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is described relative to a (Euclidean) reference frame, which we call g G SE(3). The 
two entities 

S C M 3 ; g G S£(3) (B.2) 

describe the geometry of the scene, and in particular we call g the pose relative to 
a fixed (or "inertial") reference frame 1 and S the shape of objects, although a more 
proper definition of shape would be the quotient S/g [101]. This is, however, inconse- 
quential as far as our discussion is concerned. 

Objects interact with light in ways that depend upon their material properties. De- 
scribing the interaction of light with matter can be rather complicate if one seeks physi- 
cal realism: one would have to start from Maxwell's equations and describe the scatter- 
ing properties of the volume contained in each object. That is well beyond our scope. 
Besides, we do not seek physical realism, but only to capture the phenomenology of 
the material to the extent in which it can be used for detection, recognition or other 
visual decision tasks. We will therefore start from a much simpler model, one that is 
popular in computer graphics, because it can describe with sufficient accuracy a suf- 
ficient number of real-world objects: each point p on an object S has associated with 



it a function /3 : 



-](v,l) i — y P(v,l) that determines the portion of en- 



ergy 2 coming from a direction I that is reflected in the direction v, each represented as 
a point on the half-sphere M 2 centered at the point p. This is called the bi-directional 
reflectance distribution function (BRDF) and is measured in [1/sterad]. This model 
neglects diffraction, absorption, subsurface scattering; it only describes the reflective 
properties of materials {reflectance). 

To make the notation more precise, we define a local (Euclidean) reference frame, 
centered at the point p with the third axis along the normal to the surface, = N p _L 
T P S and first two axes parallel to the tangent plane. We call such a local reference 




Figure B. 1 : Local reference frame at the point p. 
frame g p , which is described in homogeneous coordinates by 



9 P 



u p N p 



p 
1 



(B.3) 



*If a point p is represented in coordinates via X £ R 3 , then the transformed point gp is represented in 
coordinates via RX + T, where R G SO (3) is a rotation matrix and T £ R 3 is a translation vector. The 
action of SE(3) on a vector is denoted by g*v, so that if the vector v has coordinates V E R 3 , then g*v 
has coordinates RV . See [125], Chapter 2 and Appendix A, for more details. 

2 The term "energy" is used colloquially here to indicate radiance, irradiance, radiant density, power etc. 
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where u p , v p and N p are unit vectors. Therefore, a point q in the inertial reference 
frame will transform to g p q in the local frame at p. Similarly, a vector v in the inertial 
frame will transform to g p v in the local frame where 



9p* 



v p N p ] 




(B.4) 



The total energy radiated by the point p in a direction v is obtained by integrating, of 
all the energy coming from the light source, the portion that is reflected towards v, 
according to the BRDF. The light source is the collection of objects that can radiate 
energy. In principle, every object in the scene can radiate energy (either by reflection 
or by direct radiation), so the light source is just the scene itself, L = S, and the energy 
distribution can be described by a distribution of directional measures on L, which we 
call dE G L\ oc (L x M 2 ), the set of locally integrable distributions on L and the set 
of directions. These include ordinary functions as well as ideal delta measures. The 
distribution dE depends on the properties of the light source, which is described by 
a function Rl : L x H 2 — » R of the point q on the light source and a direction (see 
Appendix A. 3 for the relationship between dE and Rl). The collection 

/?(-, •) : tf x tf 4 R + ; L and dE : L x H 2 -> R + (B.5) 

describes the photometry of the scene (reflectance and illumination). Note that f3 
depends on the point p on the surface, and we are imposing no restrictions on such 
a dependency. For instance, we do not assume that f3 is constant with respect to p 
(homogeneous material). When emphasizing such a dependency we write /3(v,l;p). 

In addition, reflectance (BRDF) and geometry (shape and pose) are properties of 
each object that can change over time. So, in principle, we would want to allow fi, S, g 
to be functions of time. In practice, we assume that the material of each object does 
not change, but only its shape, pose and of course illumination. Therefore, we will use 

S = S(t); g = g(t), te [0,T] (B.6) 

to describe the dynamics of the scene. The index t can be thought of as time, in case a 
sequence of measurements is taken at adjacent instants or continuously in time, or it can 
be thought of as an index if disparate measurements are taken under varying conditions 
(shape and pose). We often indicate the index t as a subscript, for convenience: St = 
S(t); g t = g(t). Note that, as we mentioned, the light source (L, dE) can also change 
over time. When emphasizing such a dependency we write L(t) and dE(q, l;t), or 
L u dE t (l). 

Example 8 The simplest surface S one can conceive of is a plane: S = {p G R 3 | (iV, p) 
d] where N is the unit normal to the plane, and d is its distance to the origin. For a 
plane not intersecting the origin, 1 jd can he lumped into N, and therefore three num- 
bers are sufficient to completely describe the surface in the inertial reference frame. In 
that case we simply have S a constant, and g = e, the identity. A simple light source 
is an ideal point source, which can be modeled as L G R 3 with infinite power density 
dE = EiS(q — L)dL(q). Another common model is a constant ambient illumination, 
which can be modeled as a sphere L = S 2 with dE = E^dL. We will discuss examples 
of various models for the BRDF shortly. 
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Remark 27 (Choosing a level of granularity in the representation) Note that by as- 
suming that the world is made of surfaces we are already imposing significant restric- 
tions, and we are implicitly choosing a level of granularity for our representation. 
Consider for instance the fabric shown in Figure B.2. There is no surface there. The 
fabric is made of thin one -dimensional threads, just woven tightly enough to give the 
impression of spatial continuity. Therefore, we choose to represent them as a smooth 
surface. Of course, the variation in the appearance due to the fine -scale structure of 
the threads has to be captured somehow, and we delegate this task to the reflectance 
model. Naturally, one could even describe each individual thread as a cylindrical sur- 
face modeled as an object S, but this is well beyond the level of detail that we want 
to capture. This example illustrates the fact that describing objects entails a notion 
of scale. Something (e.g. a thread) is an object at one scale, but is merely part of a 
texture at a coarser scale. Figure B.2 highlights the modeling tradeoff between shape 
and reflectance: one could model the fabric as a very complex object (woven thread) 
made of homogeneous material (wool), or as a relatively simple object (a smooth sur- 
face) made of textured material. Although physically different, these two scenarios are 
phenomenologically indistinguishable, which relates to the discussion earlier in the 
manuscript of the role between the light field and the complete representation. 




Figure B.2: A complex shape (woven thread) with simple reflectance (homogeneous 
material), or a simple shape (a smooth surface) with complex reflectance (texture)? 



Remark 28 (Tradeoff between shape and motion) As we have already noted, instead 
of allowing the surface S to deform arbitrarily in time via S(t), and moving rigidly in 
space via g(t) £ SE(3), we can lump the motion and deformation into g(t) by allowing 
it to belong to a more general class of deformations G, for instance diffeomorphisms, 
and let S be constant. Alternatively, we can lump the deformation g(t) into S and just 
describe the surface in the inertial reference frame via S(t). This can be done with 
no loss of generality, and it reflects a fundamental tradeoff in modeling the interplay 
between shape and motion [173]. 
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Now, if we agree that a scene can be described by its geometry, photometry and dy- 
namics, we must decide how these relate to the measured images. 



B.1.3 And how are the two related? 

Given a description of the geometry, photometry and dynamics of a scene, a model 
of the image is obtained through a description of the imaging device. An imaging 
device is a series of elements designed to control light propagation. This is typically 
modeled through diffraction, reflection, and refraction. We ignore the first two, and 
only consider the effects of refraction. For simplicity, we can also assume that the set 
of objects that act as light sources and those that act as light sinks are disjoint, so that 
S D L = 0, i.e. we ignore inter-reflections. Note that S needs not be simply connected, 
so we can divide simply connected regions of the scene into "light" L and "objects" 
Si, i = 1, . . . ,iV . 

Now, using the notation introduced in the previous section, we want to determine 
the energy that impinges on a given pixel as a function of the shape of the scene S, 
its BRDF /3, the light source L and its energy distribution dE, and the position and 
orientation of the camera. For simplicity, given the tradeoff between shape and motion 
discussed in Remark 28, we describe the (possibly time-varying) shape of the scene 
in the inertial frame and drop the explicit description of its pose. In fact, to further 
simplify the notation, we can choose the inertial frame to coincide with the position 
and orientation of the viewer at time t = 0, so that if Iq(xq) is the first image, then the 
scene can be described as a surface parameterized by xq\ S t (xo). We then describe the 
position and orientation of the camera at time t relative to the camera at time using a 
moving Euclidean reference frame g t G SE(3). Following the derivation in Appendix 
A. 3, the intensity (irradiance) measured at a pixel x on the image indexed by t is given 
by 



' ' I t {x) = f L P(lp X (t),g p q)(N p ,l pq )dE(q,g q p)] 
x = 7r(g t p)] peS 



The Imaging Equation 

(B.7) 



where the symbols above are defined as follows: 



Directions: In the equation above, we have defined l px = g p ~ 1 g*(t)~ 1 x, g p and g p ^ 
are defined by Equation (B.3) and (B.4) respectively, l pq = p — q/\\p — q\\ and 
g p q indicates the (normalized) direction from p to q, and similarly for g q p\ 

Light source: L c M 3 is the (possibly time- varying) collection of light sources emit- 
ting energy with a distribution dE : L x M 2 — >> R + at every point q G L towards 
the direction of a point p on the 

Scene: a collection of (possibly time- varying) piecewise smooth surfaces S C M 3 ; 
f) : M 2 x H 2 x S — » R is the bi-directional reflectance distribution function 
(BRDF) that depends on the incident direction, the reflected direction and the 
point p G S on the scene S and is a property of its material. 
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Motion: relative motion between the scene and the camera is described by the motion 
of the camera g t G SE(3) and possibly the action of a more complex group G, 
or simply by allowing the surface St to change over time. 

Projection: tt : R 3 i— » S 2 denotes ideal (pinhole) perspective projection, modeled here 
as projection onto the unit sphere, although the same model applies if tt : R 3 — » 
P 2 , in which case l px has to be normalized accordingly. 

Visibility and cast shadows: One should also add to the equation two characteristic 
function terms: Xv(%, i) outside the integral, which models the visibility of the 
scene from the pixel x, and Xs(p, o) inside the integral to model the visibility of 
the light source from a scene point (cast shadows). We are omitting these terms 
here for simplicity. However, in some cases that we discuss in the next section, 
discontinuities due to visibility or cast shadows can be the only source of visual 
information. 

Remark 29 (A philosophical aside on scene modeling) One could argue that the real 
world cannot be captured by simple mathematical models of the type just described, 
and even classical physics is largely inadequate for the task. However, we are not look- 
ing for an absolute model. Instead, we are looking to describe the scene at the level 
of granularity that is suitable for us to be able to perform inference and accomplish 
certain tasks. So, what is the " right" granularity? For us a suitable model of the scene 
is one that can be validated with other existing sensing modalities, for instance touch. 
This is well illustrated by the fabric of Figure B.2, where at the level of granularity re- 
quired the scene can be safely described as a smooth surface. Notice that this is similar 
to what other researchers have suggested by describing the scene as a functional that 
cannot be directly measured. However, such afunctional can be evaluated with various 
test-functions. Physical instruments provide a set of test functions, and imaging device 
provide yet another set of test functions. The goal of the imaging model, therefore, can 
be thought of as relating the value of the scene functional obtained by probing with 
physical instruments to the value obtained by probing with images. 

The imaging equation is relevant because most of computer vision is about invert- 
ing it; that is, inferring properties of the scene (shape, material, motion) regardless 
of pose, illumination and other nuisances (the visual reconstruction problem). How- 
ever, in the general formulation above, one cannot infer photometry, geometry and 
dynamics from images alone. Therefore, we are interested in deriving a model that 
strikes a balance between tractability (i.e. it should contain only parameters that can 
be identified) and realism (i.e. it should capture the phenomenology of image for- 
mation). We will use simple models that are widely used in computer graphics to 
generate realistic, albeit non-perfect, images: Phong (corrected) [149], Ward [201] and 
Torrance- Sparrow (simplified) [187]. All these models include a function pd(p) called 
(diffuse) albedo, and a function p s (p) called specular albedo. Diffuse albedo is of- 
ten called just albedo, or, improperly, texture. In the next section we discuss various 
special cases of the imaging equation and the role they play in visual reconstruction. 
Here we limit ourselves to deriving the model under a generic 3 illumination consist- 



3 See Theorem 18. 
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ing of an ambient term and a number of concentrated point light sources at infinity: 

L = § 2 U {Li, L 2 , . . . , G M 3 , d£(g) = £ d£(g) + ELi - ^ 

this case the imaging equation reduces to 



h{x) = Pd (p) (eo + E- =1 W P , U)) + P .(p) Ei ^ igL ^g|P 

x = n(g t p); p = S(x ). 

(B.8) 



Remark 30 Afote £to model does not explicitly include occlusions and shadows. 
Also, note that the first (diffuse) term does not depend on the viewpoint g, whereas the 
second term (specular) does. However, note that, depending on the coefficient c, the 
second term is only relevant when x is close to the specular direction and, therefore, 
if one assumes that the light sources are concentrated, the second term is relevant in 
a small subset of the scene. If we threshold the effects of the second term based on 
the angle between the viewing and the specular direction, then we can write the above 
model as 



L{x) 



f pd(p)(E + TH =1 E i (N p ,L i )) if (g^x + Lt/WLilNp) < 7 (c)Vi 

k p s (p) otherwise 

(B.9) 

where i = arg min^ x+lj\\l z \\ n p ) wn { cn justifies the rank-based model of [91 ]. 

\9t x ^p) 

Empirical evaluation of the validity of this model, and the resulting "brightness con- 
stancy constraint," discussed in the next subsection, has not been thoroughly addressed 
in the literature. 

The "identity" of a scene or an object is specified by its shape S and its reflectance 
properties f3. The illumination L{t),dE(-,t), visibility t) and pose/deformation 
g(t) are "nuisance factors" that affect the measurements but not the identity of the 
scene/object 4 . They change with the view, whereas the identity of the object does not. 
In the imaging equation (B.8) we measure It(x) for all x G D and t = t±, £2, • • • , t m , 
and the unknowns are L (tj ) , dE ( m ,tj),g(tj), which for simplicity we indicate as Lj , dEj (-),gj 
respectively, for all j = 1, . . . , m. For simplicity, we indicate all the unknowns of in- 
terest with the symbol £ (note that some unknowns are infinite-dimensional), and all 
the nuisance variables with v. Equation (B.8), once we write the coordinates of the 
point p relative to the pixel in the moving frame, p = g(t)~ 1 7r~ 1 (x^ t), can then be 
written as a functional h, formally, as follows: 5 



I = /i(£, v) + n. The Imaging Equation Lite (B.10) 



4 Depending on the problem at hand, some unknowns may play either role: motion, for instance, could 
be a quantity of interest in tracking, but it is a nuisance in recognition. Illumination will almost always be a 



nuisance. 

5 



Note that the symbol v for "nuisance" in the symbolic equation may be confused with N p , the normal 
to the surface in the physical model. Since the two symbols will be used exclusively in different contexts, it 
should be clear which one we are referring to. 
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where, to summarize the equivalence of (B.10) with (B.8), we have 

7:DcR 2 ^M 3 

f G C(R 2 \V R 3 ) x W(8 2 x § 2 -> R+) = S 
v G SE(3) x BV(R 3 -> R+) x P(R 3 R 2 ) = 1/ 
' ft : R 3 x BV(R 3 -> R+) x #£(3) x R+ x R 3/c x R fc R+ (B ' U) 
(p,^,^,E ,{Xi,...,L fc },{Ei,...,^ fc }) ^ J 

where P is a subset of measure zero (the set of discontinuities), BV denotes functions 
of bounded variation. We will use the symbolic notation of (B.10) and the explicit 
notation of (B.8) interchangeably, depending on convenience. In some cases we may 
indicate the arguments of the functions /, £, z/, n explicitly. 

Remark 31 (Occlusions and cast shadows) Occlusions are an accident of image for- 
mation that significantly complicates our modeling efforts. In fact, while they are "nui- 
sances " in the sense that they do not depend solely on the scene, the do depend on 
both the scene and the viewpoint (for occlusions) and illumination (for cast shadows). 
That is why, despite depending on the nuisance, under suitable conditions they can 
be exploited to infer the shape of the scene (see [211] for occlusions and [25] for cast 
shadows). For the case of illumination, as we will show, there is no loss of generality in 
assuming ambient + point-light illumination, at which point cast shadows are simple to 
model as a selection process of what sources are visible from each point. Nevertheless, 
it is a global inference problem that requires a global solution. 



B.2 Special cases of the imaging equation and their role 
in visual reconstruction (taxonomy) 

In its general formulation above, the imaging equation cannot be inverted. Therefore, 
it is common to make assumptions on some of the unknowns in order to recover the 
others. In this section we aim at enumerating a collection of special cases that com- 
pounded characterize most of what can be done in visual inference. We start with 
models of reflection. 

Many common materials can be fruitfully described by a BRDF. Exceptions include 
translucent materials (e.g. skin), anisotropic material (e.g. brushed aluminum), micro- 
structured material (e.g. hair) etc. However, since our goal is not realism in a physi- 
cal simulation, we are content with some common BRDF that are well established in 
computer graphics: Phong (corrected) [149], Ward [201] and Torrance- Sparrow (sim- 
plified) [187]. 

Phong (corrected) /3(v,l) — Pd(p) + Ps(p) cosC V cos ®% cos #o- 

Here cos S — (g(t)~ 1 x + N p ) where each term in the inner product is 

normalized, and Oi = arccos(/, N p ), and arccos(0 o ) = (v,N p ); c G M is a 
coefficient that depends on the material. 
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Ward P(v,l)=p d (p)+p a (p) 



exp(- tan 2 (5)/a 2 ) 



\/cos 9i cos 9 Q 

Here a G M is a coefficient that depends on the material and is determined 
empirically. 



Torrance-Sparrow (simplified) f3{v,l) = Pd(p) + Ps(p) 



exp(-£7cQ 
cos 6>j cos o ' 



Separable radiance As Nayar and coworkers point out [ ], the radiance for the lat- 
ter model can be written as the sum of products, where the first factor depends 
solely on material (diffuse and specular albedo), whereas the second factor com- 
pounds shape, pose and illumination. 

In all these cases, Pd(p) is an unknown function called (diffuse) albedo, and p s (p) is an 
unknown function called specular albedo. Diffuse albedo is often called just albedo, 
or, improperly, texture. 

Note that the first term (diffuse reflectance) is the same in all three models. The 
second term (specular reflectance) is different. Surfaces whose reflectance is captured 
by the first term are called Lambertian, and are by far the most studied in computer 
vision. 



B.2.1 Lambertian reflection 

Lambertian surfaces essentially look the same regardless of the viewpoint: (3(v,l) = 
P(w,l) yw G H 2 . This yields major simplifications of the image formation model. 
Moreover, in the case of constant illumination, it allows relating different views of 
the same scene to one another directly, bypassing the image formation model. This is 
known as the local correspondence problem, which relies crucially on the Lambertian 
assumption and the resulting brightness constancy constraint. 6 We address this case 
first. 



Constant illumination 

In this case we have L(t) = L and dE(q, l\t) = dE(q, I). We consider two simple 
light source models first. 

Self-luminous 

Ambient light is due to inter-reflection between different surfaces in the scene. Since 
modeling such inter-reflections is quite complicated, 7 we will approximate it by as- 
suming that there is a constant amount of energy that "floods" the ambient space. 
This can be approximated by a (half) sphere radiating constant energy: L = S 2 and 
dE = EodL. In this case, the imaging equation reduces to 

J(M) = Pd(p)E [ (N p ,l)d£l(l) = E(p). (B.12) 



6 Although the constraint is often used locally to approximate surfaces that are not Lambertian. 

7 There is some admittedly sketchy evidence that inter-reflections are not perceptually salient [57]. 
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Due to the symmetry of the light source, assuming there are no shadows and having 
a full sphere, we can always change the global reference frame so that N p = e^. 
However, it is only to first approximation (for convex objects) that the integral does 
not depend on p, i.e. we can neglect vignetting. In this case, Eo can be lumped into 
Pd, yielding the simplest possible model that, when written with respect to a moving 
camera, gives 



Note that this model effectively neglects illumination, for one can think of a scene S 
that is self-luminous, and radiates an equal amount of energy p(p) in all directions. 
Even for such a simple model, however, performing visual inference is non-trivial. It 
has been done for a number of special cases: 

Constant albedo: silhouettes When p(p) is constant, the only information in Equa- 
tion (B.13) is at the discontinuities between x = 7r(g(i)p),p G S andp ^ S, i.e. 
at the occluding boundaries. Given suitable conditions, that have been first stud- 
ied by Astrom et al. [5], motion g(t) and shape S can be recovered. The recon- 
struction of shape S and albedo p has been addressed in an infinite-dimensional 
optimization framework by Yezzi and Soatto [209, 212] in their work on stereo- 
scopic segmentation. 

Smooth albedo The stereoscopic segmentation framework has been extended to allow 
the albedo to be smooth, rather than constant. The algorithm in [92] provides an 
estimate of the shape of the scene S as well as its albedo p(p) given its motion 
relative to the viewer, g(t). 

Piecewise constant/piecewise smooth albedo The same framework has been recently 
extended to allow the albedo to be piecewise constant in [89]. This amount to 
performing region-based segmentation a' la Mumford-Shah [139] on the scene 
surface S. Although it has not been done yet, the same ideas could be extended 
to piecewise smooth albedo. 

Nowhere constant albedo When V ' p(p) ^ everywhere in p, the image formation 
model can be bypassed altogether, leading to the so-called correspondence prob- 
lem which we will see shortly. This is at the base of most traditional stereo 
reconstruction algorithms and structure from motion. Since these techniques ap- 
ply without regard to the illumination, we will address this after having relaxed 
our assumptions on illumination. 

Point light(s) 

A countable number of stationary point light sources can be modeled as L = {Li,Z/2,. Li G 
R 3 , dE = Yli=i Ei^iQ ~ Li). In this case the imaging equation reduces to 




(B.13) 
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Note that, if we neglect occlusions and cast shadows, 8 the sum can be taken inside the 
inner product and therefore there is no loss of generality in assuming that there is only 
one light source. If the light sources are at infinity, p can be dropped from the inner 
product; furthermore, the intensity of the source E multiplies the light direction, so 
the two can be lumped into the vector L. We can therefore further simplify the above 
model to yield, taking into account camera motion, 



Inference from this model has been addressed for the following cases. 

Constant albedo Yuille et al. [215] have shown that given enough viewpoints and 
lighting positions one can reconstruct the shape of the scene. Jin et al. [ ] have 
proposed an algorithm for doing so, which estimates shape, albedo and position 
of the light source in a variational optimization framework. If the position of the 
light source is known and there is no camera motion, this problem reduces to 
classical shape from shading [80]. 

Smooth/piecewise smooth albedo In this case, one can easily show that albedo and 
light source cannot be recovered since there are always combinations of the two 
that generate the same images. However, under suitable conditions shape can 
still be estimated, as we discuss next. 

Nowhere constant radiance If the combination of albedo and the cosine term (the 
inner product in (B.15)) result in a radiance function that has non-zero gradi- 
ent, we can think of the radiance as an albedo under ambient illumination, and 
therefore this case reduces to multi-view stereo, which we will discuss shortly. 
Naturally, in this case we cannot disentangle reflectance from illumination, but 
under suitable conditions we can still reconstruct the shape of the scene, as we 
discuss shortly in the context of the correspondence problem. 

Cast shadows If the visibility terms are included, under suitable conditions about the 
shape of the object and the number and nature of light sources, one can recon- 
struct an approximation of the shape of the scene. 

General light distribution: the reflectance/illumination ambiguity 

As we have already discussed, in the absence of mutual illumination the light source L 
and the scene S are disjoint. We make the assumption that the light source is "far," i.e. 
the minimum distance between L and S is much smaller than the maximum distance 
between two points in S, mm pe s, q eL d(p, q) » max p ^ e s d(p, r). Under these as- 
sumptions, we can approximate the light source with a half- sphere with infinite radius, 
L = H 2 (oo). The radiance of the light source therefore only depends on the posi- 
tion q G L, but not on the direction, since the latter is always normal to L\ therefore, 

8 Cast shadows for the case of point light sources is simply modeled as a selection process to determine 
which source is visible from which point. 




(B.15) 
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dE : L — » R + ; g dE(q) > 0. An image is obtained by integrating the light source 
against the BRDF: 



I(x)= [ p p (x,\)dE(\)= [ f3 p (v px ,\)e(\)dS 



(B.16) 



where dS is the area form of the sphere, and we neglect edge effects. For nota- 
tional simplicity we write /3 p (x, A) as a short-hand for f3 p (is pxj v p \). Clearly, if we 
call /3 p (x,X) = /3 p (x, A)ft(A) and e(A) = h~ 1 (X)e(X), by substituting /3 and e in 
(B.16), we obtain the same image, and therefore illumination and reflectance can only 
be determined up to an invertible function h that preserves the positivity, reciprocity 
and losslessness properties of /3. 

To reduce this ambiguity, we show that, under the assumptions outlined above, 
there is no loss of generality in assuming that the illumination is a constant (ambient) 
term and a collection of ideal point light sources. In fact, Wiener ("closure theorem," 
[203] page 100) showed that one can approximate a positive function in L 1 or L 2 on 
the plane arbitrarily well with a sum of isotropic Gaussians: lim/v^oo YliLo ^iG{x — 
fjLil al), where G(x — fi; E) ~ exp(— (x — /i) T E(x — //)). Wilson [ ] has shown that 
this can be done with positive coefficients, using Gaussians with arbitrary covariance, 
i.e. Y^f=o EiG( x ~ Mi5 ^i) 5 Ei ^ 0- O ne could combine the two results by showing 
that for any e there exists a a = cr(e) > and an integer N = N(e) such that an 
anisotropic Gaussian with covariance E can be approximated arbitrarily well with a 
sum of N isotropic Gaussians with covariance al, i.e. \\G(x -/*;£)- £f=o £iG(x - 
Mi; <J -^)ll ^ e - Then one could adapt this result to the sphere by showing that the 
so-called angular Gaussian density approximates arbitrarily well the Langevin density 
(minimum entropy density on the sphere, also known as VonMises-Fisher (VMF), or 
Gibbs density on the sphere) G S (X — /x;S) = exp(trace(S/x T A)) where £ § 2 
and S = S T > is the dispersion. The notation A — /i in the argument of G s should 
be intended as the angle between A and [i on the sphere. Finally, one can attribute the 
kernel G s to the BRDF /3 instead of the light source. Neglecting the second argument 
in E) and neglecting visibility effects, we have: 



= f p p (x,X)J2EiG 3 (X-fH)dS= (B.17) 
J $ 2 i= o 

= Y Ei [ /3 p (x, A) / G S (X')S(X -m- X')dS(X')dS(X) = 

N 

= y^Et p p {x, X' + X")G S (X')5(X" - vn)dS(\')dS(\") = 

r N 

= / P p (x,X")J2Ei5(X" -Hi)dS (B.18) 

where (3 p (x, A) = / g2 (3(x, X + X')G(X')dS(X') and therefore one cannot distinguish 
viewed under point-light sources located at /ii, . . . , /ijv from (3 viewed under the 
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general illumination dE. With some effort, following the trace above, one could prove 
the following result: 

Theorem 18 (Gaussian light) Given a scene viewed under distant illumination, with 
no self -reflections, there is no loss of generality in assuming that the light source con- 
sists in an ambient (constant) illumination term Eq and a number N of point-light 
sources located at p,\, . . . , iijsr each emitting energy with intensity Ei > 0. 

Given these considerations, we restrict our attentions to illumination models that 
consist of the sum of a constant ambient term and a countable number of point light 
sources. The general case, therefore, reduces to the special cases seen above: 



Note that the energy does not depend on the direction, since for distant lights (sphere 
of infinite radius) all directions pointing towards the scene are normal to L. 

Remark 32 (Spherical harmonics) Note that current work on general representa- 
tions of illumination uses a series expansion of the distribution dE on L = S 2 into 
spherical harmonics [154]. While this is appropriate for simulation, in the context of 
inference this is problematic for two reasons: first, spherical harmonics are global, so 
the introduction of another term in the series affects the entire image. Second, while 
any smooth function on the sphere can be approximated with spherical harmonics, 
there is no guarantee that such a function be positive, hence physically plausible. In- 
deed, the harmonic terms in the series are themselves not positive, and therefore each 
individual component does not lend itself to be interpreted as a valid illumination, and 
there is no guarantee except in the limit where the number of terms goes to infinity 
that the truncated series will be a valid illumination. The advantage of a sum of Gaus- 
sian approximation is that one can approximate any positive function, and given any 
truncation of the series one is guaranteed to have a positive distribution dE. 

Remark 33 (Local discrete kernels) An alternative to using Gaussians is to use sim- 
ple functions defined on a tiling of the sphere. For such functions to be translation- 
invariants, however, the sphere would have to be tiled in regular subdivisions, and this 
is known to be impossible, as it would entail the existence of regular polygons with an 
arbitrary number of faces. The same holds for discrete approximations of wavelets on 
the sphere. 

Remark 34 (Illumination variability of a Lambertian plane) Consider an image gen- 
erated by a model (B.9). We are interested in modeling the variability induced in two 
images of the same scene under different illumination. We will assume that illumi- 
nation can be approximated by an ambient term Eo and a concentrated point source 
with intensity E\ located at L, so that each image Ii{xj) can be approximated by 
Pd(p)(Eo(ti) + Ei(ti)(N p , L(ti)) + /3(i) where the latter term lumps together the 
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(B.19) 
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effects of non-Lamb ertian reflection. This neglects vignetting (eq. (B.12)). The rela- 
tionship between two images, the, can be obtained by eliminating the diffuse reflection 
Pd, so as to obtain 

i(x 1 ,t l) { x 2 , t 2) + Ei{h){N ^ L[h)) Eo{h) + El{h){Npj L{h)) 

Now, if the scene is a plane, then the first fraction on the right hand side does not 
depend on p, i.e. it is a constant, say a. The second and third term depend on p if the 
scene is non-Lamb ertian. However, if non-Lamb ertian effects are negligible (i.e. away 
from the specular lobe), or altogether absent like in our assumptions, then the second 
term can also be approximated by a constant, say ft. Furthermore, for the case of a 
plane x\ and x 2 are related by a homography, x\ = Hx2 where x\ and x 2 are intended 
in homogeneous coordinates. Therefore, the relationship between the two images can 
be expressed as 

I(x2,t 2 )=aI(Hx2,t 2 )+p. (B.20) 

One can therefore think of one of the images (e.g. /(•, to) = p) as the scene, and the 
images are obtained by a warping H of the domain and a scaling a and offset ft of 
the range. All the nuisances, H,a,fi are invertible, and therefore a planar Lamb ertian 
scene one can construct a complete invariant descriptor. 

Multi-view stereo and the correspondence problem 

If the radiance of the scene Rs (p) is not constant, under suitable conditions one can 
do away with the image formation model altogether. Consider in fact the irradiance 
equation (A. 11). Under the Lambertian assumption, given (at least) two viewpoints, 
indexed by t\ and t 2 , we have that 

I(x u h) = R s (pM9(ti)p)) = I(x 2 M) (B.21) 

without regard to how the radiance Rs comes to existence. The relationship between 
x\ and x 2 depends solely on the shape of the scene S and the relative motion of the 
camera between the two time instants, g\ 2 = git^gfo) -1 : 

xi =7r(g 12 7Tg 1 (x 2 )) = w(x 2 ;S,g 12 ). (B.22) 

Therefore, one can forget about how the images are generated, and simply look for the 
function w that satisfies (substitute the last equation into the previous one) 



I(w(x 2 ;S,g 12 ),t 1 ) = I(x 2 ,t 2 ) . (B.23) 



Finding the function w from the above equation is known as the correspondence prob- 
lem, and the equation above is the brightness constancy constraint. 

More recently, Faugeras and Keriven have cast the problem of stereo reconstruc- 
tion in an infinite-dimensional optimization framework, where the equation above is 
integrated over the entire image, rather than just in a neighborhood of feature points, 
and the correspondence function w is estimated implicitly by estimating the shape of 



B.2. SPECIAL CASES OF THE IMAGING EQUATION AND THEIR ROLE IN VISUAL RECONSTRUCTION (TAXONOk 



the scene S, with a given motion g. This works even if p is constant, but due to a 
non-uniform light and the presence of the Lambertian cosine term (the inner product 
in equation (B.15)) the radiance of the surface is nowhere constant (shading effect, or 
attached shadow) and even in the case of cast shadows, if the light does not move. In 
the presence of regions of constant radiance, the algorithm interpolates in ways that 
depend upon the regularization term used in the infinite-dimensional optimization (see 
[ ] for more details). 

Constant viewpoint: photometric stereo 

When the viewpoint is fixed, but the light changes, inverting the model above is known 
as photometric stereo [81]. If the light configuration is not known and is allowed to 
change between views, Belhumeur and coworkers have shown that this problem cannot 
be solved [18]. In particular, given two images one can pick a surface S at will, and 
construct two light distributions that generate the given images, even if the scene is 
known to be Lambertian. However, this result relies on the presence of a single point 
light source. We conjecture that if the illumination is allowed to contain an ambient 
term, these results do not apply, and therefore reconstruction could be achieved. Note 
that psychophysical experiments suggest that face recognition is extremely hard for 
humans under a point light source, whereas a more complex illumination term greatly 
facilitates the task. 

B.2.2 Non-Lambertian reflection 

In this subsection we relax the assumption on reflectance. While, contrary to intuition, 
a more complex reflectance model can in some cases facilitate recognition, in general 
it is not possible to disentangle the effects of shape, reflectance and illumination. We 
start by making assumptions that follow the taxonomy used for the Lambertian case in 
the previous subsection. 

Constant illumination 
Ambient light 

In the presence of ambient illumination, the specular term of an empirical reflection 
model, for instance Phong's, takes the form 



If the exponent c — » oo, only one point on the light surface S 2 contributes to the 
radiance emitted from the point p. Since the distribution dE is uniform on L, we 
conclude that, if we exclude occlusions and cast shadows, this term is a constant. This 
can be considered as a limit argument to the conjecture that, in the presence of ambient 
illumination, the specular term is negligible compared to the diffuse albedo. Naturally, 
if an object is perfectly specular, it renders the viewer an image of the light source, so 
in this case inter-reflection is the dominant contribution, and the ambient illumination 
approximation is no longer justified. See for instance Figure B.3. 




(B.24) 
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Figure B.3: In the presence of strongly specular materials, the image is essentially a 
distorted version of the light source. In this case, modeling inter-reflections with an 
ambient illumination term is inadequate. 

Point light(s) 

In the presence of point light sources, the specular component of the Phong models 
becomes 



where the arguments of the inner products are normalized. In this case, assuming that 
a portion of the scene is Lambertian and therefore motion and shape can be recovered, 
one can invert the equation above to estimate the position and intensity of the light 
sources. This is called "inverse global illumination" and was addressed by Yu and Ma- 
lik [214]. If the scene is dominantly specular, so no correspondence can be established 
from image to image, we are not aware of any general result that describes under what 
condition shape, motion and illumination can be recovered. Savarese and Perona [162] 
study the case when assumptions on the position and density of the light, such as the 
presence of straight edges at known position, can be exploited to recover shape. 



General light 

In general, one cannot separate reflectance properties of the scene with distribution 
properties of the light sources. Jin et al. [91] showed that one can recover shape 
S as well as the radiance of the scene, which mixes the effects of reflectance and 
illumination. 




(B.25) 
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Constant viewpoint 

In the presence of multiple point light sources, Many have studied the conditions under 
which one can recover the position and intensity of the light sources, see for instance 
[142] and references therein. Variations of photometric stereo have also been developed 
for this case, starting from [83]. 

Reciprocal viewpoint and light source 

Zickler et al. [218] have developed techniques to exploit a very peculiar imaging setup 
where a point light source and the camera are switched in pairs of images, which allows 
us to eliminate the BRDF from the imaging equation. 

B.3 Analysis of the ambiguities 

The optimal design of invariant feature requires an analysis of the ambiguities in shape, 
motion (deformation), reflectance and illumination. While special cases of this analysis 
have been presented in the past, especially in the field of reconstruction (shape/motion 
[210], reflectance/illumination [141], shape/reflectance [ ], exemplified below), to 
this date there is no comprehensive analysis of the ambiguities in shape, motion, re- 
flectance and illumination. 

Example 9 (Tradeoff between shape and motion) We note that, instead of allowing 
the surface S to deform arbitrarily in time via S(t), and moving rigidly in space via 
g(t) G SE(3), we can lump the motion and deformation into g(t) by allowing it to 
belong to a more general class of deformations G, for instance diffeomorphisms, and 
let S be constant. Alternatively, we can lump the deformation g(t) into S and just 
describe the surface in the inertia! reference frame via S(t). This can be done with 
no loss of generality, and it reflects a fundamental tradeoff in modeling the interplay 
between shape and motion [173]. 

The reflectance/illumination ambiguity has been addressed in Section B.2.1. Given 
the conclusions reached there, we restrict our attention to illumination models that 
consist of the sum of a constant ambient term and a countable number of point light 
sources. 

Even under these restrictive modeling assumption, it can be shown that illumination- 
invariant statistics do not exist. 

Theorem 19 (Non-existence of single-view illumination invariants) There exists no 
discriminative illumination invariant for illumination under the Lambert-Ambient model. 

This theorem was first proved by Chen et al. in [ ] (although also see [166], as cited 
by Zhou, Chellappa and Jacobs in ECCV 2004). Here we give a simplified proof that 
does not involved partial differential equations, but just simple linear algebra. We refer 
to the model (B.8) where we restrict the scene to be Lambertian, p s = 0: If illumina- 
tion invariants do not exist for Lambertian scenes, they obviously do not exist for more 
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general reflectance models. Also, we neglect the effects of occlusions and cast shad- 
ows: If an illumination invariant does not exist without cast shadows, it obviously does 
not exist in the presence of cast shadows, since these are generated by the illumination. 
We neglect occlusions in the sense that we consider equivalent all scenes that are equal 
in their visible range. 

Proof: Without loss of generality, following Theorem 18, we consider the illumi- 
nation to be the superposition of an ambient term {S 2 , Eq} and a number of point light 
sources {Z^, Rather than representing each light with a direction Li G S 2 

and an intensity Ei G M + , we just consider their product and call it Li G M 3 . In the ab- 
sence of occlusions we can lump all the light sources onto one L = j Y^i=i £ ^ 3 
and therefore, again without loss of generality, we can assume that the model of the 
image formation model is given by It(x t ) = p(p)(E + (N p , L)) where x t = 7r(g t p) 
and p G S(xq), xo G D. If we consider a fixed viewpoint, without loss of generality 
we can let g(t) — e so that x(t) = xq = x, and the image-formation model is 

I(x) = [ p{p) p{p)v% ] 

where A p depends on the scene, and iGl 4 depends on the nuisance. If we collect 
measurements at a number of pixels xi,...,x n , and stack the measurements into a 
column vector I = . . . , I(x n )] T , and so for the matrix A, we can write the 

image-formation model in matrix form as / = AX. Now, since A G R nx4 has rank at 
most 4, without loss of generality we will assume that n = 4. Now, using this notation, 
the question of existence of an illumination invariant can be posed as the existence of 
a function of the image I that does not depend on the nuisance X. In other words, <j) 
has to satisfy the following condition 

<t>{I) = (/>(AX 1 ) = </>(AX 2 ) VIi,I 2 G M 4 . (B.27) 

Now, if such a function existed, then, because AX = AHH~ 1 X = BX for any 
H G GL(4), and the matrix B G R 4x4 can be arbitrary, we would have that <j){AX) = 
4>(BX) — (j)(BX) for all X, and therefore the function (j) would not depend on A, 
hence it would be non-discriminative. 

Remark 35 (Invariance to ambient illumination) Notice that if the vector X G M 4 
in the previous proof was instead a scalar (for instance, if X = Eq and L = 0, i.e. 
there is only ambient illumination), then it is possible to find a function of I that does 
not depend on Eq, simply 4>{I) = I(x\)/I{x2) or any other normalizing ratio. See 
Remarks 34 for more details on the illumination model implied by a Lambertian plane 
and its invariance. Note, however, that the ambient illumination term depends on the 
global geometry of the scene, which determines cast shadows. It may be possible to 
show that invariance to illumination can be achieved. 

B.3.1 Lighting-shape ambiguity for Lambertian scenes 

In a series of recent papers [215], Yuille and coworkers have shown that for Lamber- 
tian objects viewed under point light sources from an arbitrary viewpoint there is an 
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important class of ambiguities for 3-D reconstruction. According to [215], given a 
certain scene, seen from a collection of viewpoints under a certain light configuration, 
there exists a different scene, a different collection of viewpoints and a different light 
configuration that generates exactly the same images, and therefore from these it is not 
possible to reconstruct the scene geometry (shape), photometry (albedo, illumination), 
and dynamics (viewpoints) uniquely. In particular, equivalent scenes are character- 
ized by a global affine transformation of the surface normals, the viewpoints, and the 
lighting configuration. 

Yuille's results pertain to a Lambertian scene viewed under ideal point light sources 
from a weak perspective (affine) camera. In [197], however, it is argued that a more 
realistic model of illumination in real- world scenes is not a collection of point sources, 
but an ambient illumination term, which captures the collective effect of mutual illumi- 
nation, and a collection of isolated sources. Here we show that, under this illumination 
model, the KGBR ambiguity described in [215] disappears, and one is left with the 
usual projective reconstruction ambiguity well-known in structure-from-motion [125]. 
Note that Kriegman and co-workers showed that mutual illumination causes the KGBR 
to disappear (CVPR 2005). 

To introduce the problem, we follow the notation of [197], where the basic model 
of image formation for a Lambertian scene is given by 

It(x t ) = p(p) (e (p) + J^N^L^ (B.28) 

where p(p) : S —> R + is the diffuse albedo, N is the number of light sources, N p G S 2 
is the outward normal unit vector to the scene at p G S, Li G S 2 is a collection 
of positions on the sphere at infinity which denotes the ideal point light sources, and 
Eo (p) is the ambient illumination term. Since the ambient light L is assumed to be the 
sphere at infinity, in the absence of self-occlusions and cast shadows (i.e. for convex 
objects) this term is constant. Otherwise, it is a constant modulated by a solid angle 
that determines the visibility of the ambient sphere from the point p: 

Eo(p) = E I (N p , Z)dfi(Z) = E n(p). (B.29) 

Js 2 

In the presence of multiple viewpoints the different images are obtained from the imag- 
ing equation by the correspondence equation that establishes the relationship between 
x and p: 

x t = Tr(g t p), peS, g t e SE(3) (B.30) 

where tt : R 3 — >> R 2 is the ideal perspective projection map. If we rewrite the right 
hand-side of the imaging equation as It(x t ) = p(p)l(p), it is immediate to see that 
we cannot distinguish the radiance p(p) and l(p) from p(p) = p{p)a{p) and l(p) = 
a~ 1 (p)l(p), where a : S — » R + is a positive function whose constraints are to be 
determined. In the absence of the ambient illumination term, Eq, Yuille et al. showed 
that a(p) — \K T N p \. The following theorem establishes that a(p) = p, the identity 
map, and therefore the "ambient albedo" p(p)Eo(p) is an illumination invariant, up to 
affine transformations of the radiance, which can be factored out using level lines or 
other contrast-invariant statistics. 
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Theorem 20 (KGBR disappears with ambient illumination) Let a collection of im- 
age of a scene with surface S, albedo p, viewed under an ambient illumination Eq and 
a number N of point light sources Li, . . . , L^, be given: It(x), t = 1, . . . , T. The 
only other scene that generates the same images consists of a projective transformation 
ofS and the corresponding cameras tt. In the presence of calibrated cameras, the only 
ambiguity is a scale factor affecting S and the translational component of g t . 

The sketch of the proof by contradiction is as follows. Under the assumption of weak 
perspective projection (affine camera), changes in the viewpoint g t can be represented 
by an affine transformation of the scene, K t p, p G S. Dropping the subscript t, we can 
write the image of the scene from an arbitrary viewpoint as 



Ox) = P(KP) (e (Kp) + f2 ^k^n^ j = < B - 31 > 
0^ \E {Kp)\K-^N p \ +f2(K- 1 N p ,E l ) ] (B.32) 



1=1 / 

from which it can be seen that p, Eq and Ei are indistinguishable from 

*> = M;i 

E (p) = EoiKp^K-'Npl (B.34) 
Ei = KEi. (B.35) 

However, the function Eo{p) cannot be arbitrary since, from (B.29) 

E (p) = E n(p) = E Sl(Kp)\K- 1 N p \. (B.36) 

From this equation one concludes that the ratio of the solid angles = \K~ X N p \ is 

a function of the tangent plane at p, N p , which is not the case, hence the contradiction. 



Appendix C 

Tutorial material 



C.l Shape Spaces and Canonization 

Shape Spaces were first introduced to suit the need of comparing rocks [101]. We will 
start even simpler, by comparing triangles on the plane [103]. 



C.l.l Finite-dimensional shape spaces 

Each triangle can be described by the coordinates of its three vertices, xi,^,^ Gl 2 , 
or equivalently by a 2 x 3 matrix x = [xi, x<i, x%] G R 2x3 ~ R 6 . Therefore, a triangle 
can be thought of as a "point" in six-dimensional space X = R 6 . However, depending 
on the reference frame with respect to which the coordinates are expressed, we have 
different coordinates x G X. Indeed, if we "move" the triangle around the plane, its 
coordinates will describe a trajectory in X, and yet we want to capture the fact that it 
is the same triangle. Shape Spaces are designed to capture precisely this concept: The 
shape of a configuration is what is preserved regardless of the choice of coordinates, or 
equivalently regardless of the motion of the object. 

Now, even on the plane, one can consider different kinds of coordinates, or equiva- 
lently different kinds of motions. For the case of triangles, one can consider Euclidean 
coordinates, or correspondingly rigid motions, whereby the coordinates of the trian- 
gles are transformed in such a way as to preserve distances, angles and orientation. In 
this case, the matrix of coordinates x G X is transformed by the multiplication by a 



rotation matrix R G SE{2), that is a matrix of the form R 



cos v — sin t 
sin# cos 

2 



for 



some G [0, 2tt) and the addition of an offset T G R (a translation vector). So if we 
indicate with g = (R, T) G SE(2) the group of rigid motions, and with gx the action 
of the group on the coordinates, we have that x' = gx G R 2x3 and the transformed 
coordinates are x\ = Rxi + T. However, we could also consider the similarity group 
where the rectangles are allowed to be scaled, while retaining the angles, in which case 
g = (aR, T) and points are transformed via x\ = aRxi + T for some a > 0, or the 
affine group where R G GL{2) is an arbitrary 2x2 invertible matrix. In any case, what 
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we want to capture as the "essence" of the triangle x G S is what remains unchanged 
as we change the group, or the reference frame. 

Orbits 

Geometrically, we can think of the group G 3 g acting on the space X by generating 
orbits, that is equivalence classes 

[x] = {gx | g g G}. 

Different groups will generate different orbits. Remember that an equivalence class is 
a set that can be represented by any of its elements, since from any element in the set 
[x] we can construct the entire orbit by moving along the group. As we change the 
group element g, the coordinates gx change, but what remains constant is the entire 
orbit [gx] = [x] for any g G G. Therefore, the entire orbit is the object we are looking 
for: it is the invariant to the group G. We now need an efficient way to represent this 
orbit algebraically, and to compare different orbits. 

Max-out 

The simplest approach consists of using any point along the orbit to represent it. For 
instance, if we have two triangles we simply describe their "shape" by their coordinates 
x, y G R 2x3 . However, when comparing the two triangles we cannot just use any norm 
in R 2x3 , for instance d(x, y) = \\x — y\\, for two identical triangles, written in two 
different reference frames, would have non-zero distance, for instance if y = gx, we 
have d(x, y) = || x — gx || = || e — # || || # || which is non-zero so long as the group g is not 
the identity e. Instead, when comparing two triangles we have to compare all points on 
their two orbits, 

d(x,y) = min \\gix - g x y\\ R e . 

9i,92 

This procedure is called max-out, and entails solving an optimization problem every 
time we have to compute a distance. 

Canonization 

As an alternative, since we can represent any orbit with one of its elements, if we 
can find a consistent way to choose a representative of each orbit, perhaps we can 
then simply compare such representatives, instead of having to search for the shortest 
distance along the two orbits. The choice of a consistent representative, also known 
as a canonical element, is called canonization. The choice of canonization is not 
unique, and it depends on the group. The important thing is that it must depend on 
each orbit independently of others. So, if we have two orbits [x] and [y], the canon- 
ical representative of [x], call it x, cannot depend on [y\. To gain some intuition of 
the canonization process, consider again the case of triangles. If we take one of its 
vertices, for instance x\ to be the origin, so that x\ = 0, or equivalently T = —x\, 
and transform the other points accordingly, we have that all triangles are now of the 
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form [0, X2 — xi,xs — x\] = [0, x' 2 , x' 3 ]. What we have done is to canonize the trans- 
lation group. The result is that we have eliminated two degrees of freedom, and now 
every rectangle can be represented in a translation-invariant manner by a 2 x 2 matrix 

[4,4] eR 2 * 2 . 

We can now repeat the procedure to canonize the rotation group, for instance by 
applying the rotation R{9) that brings the point x 2 to coincide with the horizontal axis. 
By doing so, we have canonized the rotation group. We can also canonize the scale 
group, by multiplying by a scale factor a so that the point x 2 has coordinates [1, 0] T . By 
doing so, we have canonized the scale group, and now every triangle is represented by 
1 i 



o o « RTx * 



Now every triangle is represented by a two-dimensional 

vector =^R T (xs — x\) G M 2 . With this procedure, we have canonized the similarity 
group. If we now want to compare triangles, we can just compare their canonical 
representative, without solving an optimization problem: 

d(x,y) = \\x-y\\ R 2. 

This is a so-called cordal distance; we will describe the more appropriate notion of 
geodesic distance later. 

Note that choosing a canonical representative of the orbit x is done by choosing a 
canonical element of the group, that depends on x, g = g(x), and then un-doing the 
group action, 

x = g~ 1 (x)x. 

It is easy to show, and left as an exercise, that x is now invariant to the group, in the 
sense that g'x — x for any g' G G. This procedure is very general, and we will 
repeat it in different contexts below. However, note that the larger the group that we 
are quotienting out, the smaller the quotient, to the point where the quotient collapses 
into a single point. Consider for instance the case of triangles where we try to canonize 
the affine group. By doing so all triangles would become identical, since it is always 
possible to transform a triangle affinely into another. 

Geometrically, the canonization process corresponds to choosing a base of the orbit 
space, or computing the quotient of the space X with respect to the group G. Conse- 
quently, the base space is often referred to as X/G. Note that the canonical repre- 
sentative x lives in the base space that has a dimension equal to the dimension of X 
minus the dimension of the group G. So, the quotient of triangles relative to the trans- 
lation group is 4-dimensional, relative to the group of rigid motion it is 3-dimensional, 
relative to the similarity group it is 2-dimensional, and relative to the affine group it 
is 0-dimensional. By a similar procedure one could consider the quotient of the set 
of quadrilaterals x = [^1,^2,^3,^4] £ M 2x4 , an 8-dimensional set, with respect to 
the various groups. In this case, the quotient with respect to the affine group is an 
8 — 6 = 2-dimensional space. However, the quotient of quadrilaterals with respect to 
the projective group is 0-dimensional, as all convex quadrilaterals can be mapped onto 
one another by a projective transformation. 

One could continue the construction for an arbitrary number of points on the plane, 
and quotient out the various group: translation, Euclidean, similarity, affine, projective, 
. . . where does it stop? Unfortunately, the next stop is infinite-dimensional, the group of 
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diffeomorphisms [177], and a diffeomorphism can send any finite collection of points 
to arbitrary locations on the plane. Therefore, just like affine transformations for tri- 
angles, and projective transformations for quadrilaterals, the quotient with respect to 
diffeomorphisms collapses any arbitrary (finite) collections of N points into one ele- 
ment of R N . However, as we will see, there are infinite-dimensional spaces that are 
not collapsed into a point by the diffeomorphic group [177]. 

Distributions in the quotient space and non-linearity of the quotient 

As we have seen, the canonization procedure enables us to reduce the dimension of the 
space by the dimension of the group. For instance, triangles live in an 8-dimensional 
space, but once we mod-out similarity transformations, they can be represented by 
x G R 2 . That is, the canonical representatives can be displayed on a planar plot. 
Sometimes, this simplifies the analysis or the visualization of data. 

For instance, consider two collections of random triangles: One is made of isosceles 
triangles, one is made of scalenes. If visualized as triangles, it is very difficult to 
separate them. Visualizing them in their native 8-dimensional space is obviously a 
challenge. However, if we visualize the quotient, their structure emerges clearly. 

Of course, the mod-out operation (canonization) alters the geometry of the space. 
For instance, triangles belong to the linear space R 2 x 3 ~ R 6 . In that space, one can 
sum triangles, multiply them by a scalar, and still obtain triangles. In other words, X 
is a linear space. However, the quotient X/G is not necessarily a linear space, in the 
sense that summing or scaling canonical representative may not yield a valid canonical 
representative. Indeed, the quotient space X/G is in general a homogeneous space 
that is non-linear (curved) even when the native space X and the group G are linear. 
Therefore, when considering a distance in the base space as we have done above, one 
should in principle choose a geodesic distance (the length of the shortest path between 
two points that remains in the space) as opposed to a cordal distance that is the distance 
in the embedding space as we have done above. 

Even when, by a stroke of luck, the quotient is linear, the canonization procedure 
significantly distorts the original space. Consider in fact a collection of triangles that is 
represented by a Gaussian distribution in the space X = R 6 . Once we canonize each of 
them with respect to the Similarity group, we have a distribution in the quotient space 
X/G. This is not Gaussian, but rather part of what are known as Procrustean statistic 
[102]. 

Not all canonizations are created equal 

It is important to notice that the canonization mechanism is not unique. To canonize 
translation, instead of choosing x\, we could have chosen X2, or xs, or any combination 
of them, for instance the centroid. Similarly, for rotation we fixed the direction of the 
segment X1X2, but we could have chosen the principal direction (the singular vector of 
the matrix x corresponding to its largest singular value). 

In principle, no canonization mechanism is better than the other, in the sense that 
they all achieve the goal of quotienting-out the group. However, consider two tri- 
angles that are equivalent under the similarity group (i.e., they can be transformed 
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into one another by a similarity transformation), but where the order of the three ver- 
tices in x is scrambled: x = [#i,#2> #3] an d x = [#2>#i>#3]- Once we follow the 
canonization procedure above, we will get two canonical representatives x ^ x that 
are different. What happens is that we have eliminated the similarity group, but not 
the permutation group. So, we should consider not one canonical representative, but 
6 of them, corresponding to all possible reorderings of the vertices. One can easily 
see how this procedure becomes unworkable when we have large collection of points, 
x G M? xN ,N » 2. 

If, however, we had canonized translation using the centroid, rotation using the 
principal direction (singular vector corresponding to the largest singular value), and 
scale using the largest singular value, then we would only have to consider symmetries 
relative to the principal direction, so that choice of canonization mechanism is more 
desirable. 

Structural stability of the canonization mechanism 

Requiring that the canonization mechanism be unique is rather stiff. Geometrically, it 
corresponds to requiring that the homogeneous space X/G admits a global coordinate 
chart, which is in general not possible, and one has instead to be content with an atlas 
of local coordinate charts. 

However, what is desirable is to make sure that, as we travel smoothly along an 
orbit [x] via the action of a group g, the canonical representative x = x(gx) does not 
all of a sudden "jump" to another chart. 

Consider, again, the example of triangles. Suppose that we choose as a canonical 
representative for translation the point that has the smallest abscissa (the "left-most" 
point). As we rotate the triangle around, the canonical representative switches, which 
is undesirable. A more "stable" canonization mechanism is to choose the centroid 
as canonical representative, as it is invariant to rotations. The notion of "structural 
stability" of the canonization mechanism is very important, and involves the relation 
between the group that is being canonized and all the other nuisances (which may or 
may not be groups). The design of a suitable canonization mechanism should take such 
an issue into account. 

Of course, unless the quotient X/ G admits a global chart, we can expect that as we 
move along the base space there will be switchings between charts. 

C.1.2 Infinite-dimensional shape spaces 

The general intuition behind the process of eliminating the effects of the group G from 
the space X is not restricted to finite-dimensional space, nor to finite-dimensional 
groups. Of course, when the dimension of the group is equal to or larger than the 
space, the quotient collapses to a point, as we have seen for triangles under the affine 
group, quadrilateral under the projective group, and arbitrary collections of points 
under the diffeomorphic group. We now show how to mod-out finite-dimensional 
groups from infinite-dimensional spaces, and then infinite-dimensional groups from 
infinite-dimensional spaces. When we talk about infinite-dimensional spaces we re- 
fer to function spaces, that are characterized by a (finite-dimensional) domain X, a 
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finite-dimensional range, and a map from the former to the latter. 

As an example, we will consider images as elements of the function space X that 
maps the plane onto the positive reals, I : R 2 — » R. 

Transformations of the range of a function (left action) 

In the previous section we have considered affine transformations of R 2 . We now 
consider affine transformation of R, and apply them to the range of the function / : 
R 2 — > R; x ^ I(x). For simplicity we assume that / is smooth, defined on a compact 
subset D C R 2 , and has a bounded range. An affine transformation is defined by 
two scalars a, /?, with and transforms the range of the function /(•) via g o I = 

al+fi. Therefore, the orbits we consider are of the form [I] = {al+f3, a > 0, f3 G R}, 
and the function g o I is defined by g o I(x) = al(x) + /?. 

As in the finite-dimensional case, there are several possible canonization mecha- 
nisms. The simplest consists in choosing the canonical representative of f3 to be the 
smallest value taken by I, fj = min^ dcm 2 H x ) an d me canonical representative of 
a to be the largest value a = mdiX xeDcR 2 1{x). However, one could also choose 
the mean for (3 and the standard deviation for a. This is no different than if / was 
an element of a finite-dimensional vector space. In either case, the canonical group 
element {&, $} = g(I) is determined from the function /, and is then "un-done" via 

g~ x o I = Again, we have that the canonical element is J = g~ l (I) o /. 

More interesting is the case when the group acting on the range is infinite-dimensional. 
Consider for instance all contrast functions , that is functions k : R — >> R that are con- 
tinuous and monotonic. These form a group, and indeed an infinite-dimensional one. 
The equivalence class we consider is now 

[/] ={h/,kH}, 

and go I{x) = k(I(x)), where H is the set of contrast transformations. 

A canonization procedure is equivalent to a "dynamic time warping" [104] of the 
range of the function 1 that is chosen in a way that depends on the function itself. The 
affine range transformation was a very special case. It has been shown in [ ] that the 
quotient of real- valued function with respect to contrast transformations is the equiva- 
lence class of iso-contours of the function. So, by substituting the value of each pixel 
with the curvature of the iso-contour curves, one has effectively canonized contrast 
transformations. Equivalently, because the iso-contours are normal to the gradient di- 
rection, one can canonize contrast transformations by considering, instead of the values 
of /, the direction of the gradient. This explains the popularity of the use of gradient 
direction histograms in image analysis. 

Note that in all these cases, the canonical element of the group g = {a, j3} or 
g — fc(-), is chosen in a way that depends only on the function /(•) in question, so we 
can write the canonical element as g = g(I), and, as usual, we have 

I = g-\I)°I- 

lr The name is misleading, because that there is nothing dynamic about dynamic time warping, and there 
is no time involved. 
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Transformations of the domain of a function (right action) 

We have already seen how to canonize finite-dimensional groups of the plane. The 
orbits we consider are of the form / o g(x) = I(gx). So, if we want to canonize 
domain transformations of the function I, that is if we want to represent the quotient 
space of the equivalence class 

[I] = {Iog,geG} : 

we need to find canonical elements g that depend on the function itself. In other words, 
we look for canonical elements g = g(I). As a simple example, we could canonize 
translation by choosing the highest value of /, and rotation using the principal cur- 
vatures of the function / at the maximum. However, instead of the maximum of the 
function / we could choose the maximum of any operator (functional) acting on /, for 
instance the Hessian of Gaussian V 2 G * I{x) = J D V 2 G(x — y)I(y)dy, where D 
is a neighborhood around the extremum. Similarly, instead of choosing the principal 
curvature of the function /, we could choose the principal directions of the second- 
moment matrix J D VI T VI(x)dx. In either case, once we have a canonical represen- 
tative for translation and rotation, we have g, and everything proceeds just like in the 
finite-dimensional case. 

More interesting is the case when the group g is infinite-dimensional, for instance 
planar diffeomorphisms w : R 2 —> R 2 . In this case we consider the orbit [I] = 
{Iog,g e W} where the function / o g(x) = I(w(x)). It has been shown in [197] that 
this is possible, although we defer to a discussion of [177] below for a characterization 
of the quotient. 

In all the cases above, the canonization process consists in first determining a group 
element g that depends on the function, g = g(I), and then "undoing" the group, as 
usual, via 

i = Iog~\l). 

Joint domain and range transformations (left and right actions) 

So far we have considered functions with groups acting either on the range go I(x) = 
k(I(x)) or on the domain I o g(x) = I(w(x)). However, there is nothing that prevents 
us from considering groups acting simultaneously on the domain and range. In this 
case, we consider orbits of the kind 

[I] = idi 1 92, 9i £ Gi, g 2 e G 2 }. 

The canonization mechanism is the same, leading to g\ (J), §2(1), from which we can 
obtain the canonical element 

I = g^(I)oIog-\l). 

C.1.3 Commutativity 

When there is more than one group involved in the canonization process, one has to 
exercise attention to guarantee that the two canonizations commute, so that if we first 
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canonize the domain, and then the range, we get the same result as if we first canonize 
the range, and then the domain. In other words, g\{I) = gi(I o g 2 ) for any g 2 G G 2 , 
and vice- versa #2 CO = #2(01 I) for any g\ G G\. 

For the case of images / : R 2 — >• R subject to diffeomorphic domain deformations, 
g 2 G >V and contrast transformations #i G 7~L, it has been shown in [177] that the two 
canonization processes commute, and that the quotient is a discrete (0-dimensional) 
topological object known as the Attributed Reeb Tree (ART). It is the collection of ex- 
trema, their nature (maxima, minima, saddles), and their order - but not their value - 
and their connectivity (in the sense of the Morse-Smale Complex), but not their posi- 
tion. 

C.1.4 Covariant detectors and invariant descriptors 

The functional g(I) that chooses the canonical element of the group is also called a co- 
variant detector, in the sense that it varies with the group. Once a co- variant detector 
has been determined, the canonical representative / = / o is also called an 

invariant descriptor, in the sense that - as we have already seen - it is a representative 
of the entire orbit [I] and therefore invariant to G. 

C.2 Lambert- Ambient model and its shape space 

We consider an object of interest that is static and Lambertian, so it can be described by 
its geometry, a surface S : D C R 2 — » R 3 ; xq \-> S(xq) and its photometry, the albedo 
p : S —> R + ;p 1 — y p(p). We assume an ambient illumination model that modulates 
the albedo with a simple contrast transformation k : M —> M; / \-> k{I). The scene 
is viewed from a vantage point determined by g G SE(3), so that the point p projects 
onto the pixel x — ir(gp). In the absence of occlusions, regardless of the shape of S, 
the map from x to x is a homeomorphism, x = 7r(gS(xo)) = w~ 1 (xq)\ the choice 
of name w~ x is to highlight the fact that it is invertible. If we assume (without loss 
of generality given the visibility assumption) that p is the radial graph of a function 
Z : R 2 — >> R (the range map), so that is p = xZ(x), where x = [x, y, l] T are the 
homogeneous coordinates of x G R 2 , we have that 

w(x) _ jeie 2 ] T g-'xZjx) , _ [e 1 e 2 ] T gS{x) 

where are the i-th coordinate vectors. Putting all the elements together we have a 
model that is valid under assumptions of Lambertian reflection, ambient illumination, 
and co- visibility: 

I(x) = ko po S ow(x) +n(x), xeD. (C.l) 

In relating this model to the discussion above on canonization, a few considerations are 
in order: 

• There is an additive term n, that collects the results of all unmodeled uncer- 
tainty. Therefore, one has to require not only that left- and right- canonization 
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commute, but also that the canonization process be structurally stable with re- 
spect to noise. If we are canonizing the group g (either or wj), we cannot expect 
that g(I) = g(I — n), but we want g do depend continuously on n, and not to 
exhibit jumps, singularities, bifurcations and other topological accidents. This 
goes to the notion of structural stability and proper sampling 

• If we neglect the "noise" term n, we can think of the image as a point on the 
orbit of the "scene" p o S. Because the two are entangled, in the absence of 
additional information we cannot determine either p or S, but only their compo- 
sition. This means that if we canonize contrast k and domain diffeomorphisms 
w, we obtain an invariant descriptor that lumps into the same equivalence class 
all objects that are homeomorphically equivalent to one another [ ]. The fact 
that w(x) depends on the scene S (through the function Z) shows that when we 
canonize viewpoint g we lose the ability to discriminate objects by their shape 
(although see later on occlusions and occluding boundaries). Thus, with an abuse 
of notation, we indicate with p the composition po S. 

We now show that the planar isometric group SE(2) can be isolated from the diffeo- 
morphism w, in the sense that 



w(x) = w o g(x) = w(gx) 



(C.2) 



for a planar rigid motion g G SE(2) and a residual planar diffeomorphism w, in 
such a way that the residual diffeomorphism w can be made (locally) independent 
of planar translations and rotations. More specifically, if the spatial rigid motion 
(i2, T) G SE(3) has a rotational component R that we represent using Euler angles 
for cyclo -rotation (rotation about the optical axis), and lji,lj2 for rotation about the 
two image coordinate axes, and translational component T = [Ti, T 2 , T 3 ] T , then the 
residual diffeomorphism w(x) — w(x) can be made locally independent of Ti, T2 and 
0. To see that, note that Ri{0) = exp(e 3 #) is the in-plane rotation, and R (uj\ , 0J2) = 
exp(e 2 cj2) exp(eio;i) is the out-of-plane rotation, so that R = RiR Q - In particular, 



Ri 



Ri{0) 




where R x {0) = 



cos f 
sin£ 



— smf 

COS0 



We write R n in blocks as 



R 



R2 



T3 

^5 



where R2 G 



p2x2 



and G R. We can then state the claim: 



Theorem 21 The diffeomorphism w : R 2 — » R 2 corresponding to a vantage point 
(R,T) G SE(3) can be decomposed according to (C.2) into a planar isometry g G 
SE(2) and a residual diffeomorphism w : R 2 —> R 2 that can be made invariant to 
R\{0) and arbitrarily insensitive to Ti, T2, in the sense that V e 3 5 such that \\x\\ < 
5^ \\m <efori = 1,2. 



This means that, by canonizing a planar isometry, one can quotient out spatial transla- 
tion parallel to the image plane, and rotation about the optical axis, at least at least in a 
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neighborhood of the origin. 

Proof: We write the diffeomorphism explicitly as 

^ [eie 2 ] T (RiR xZ(x) + T) _ R 2 Ri (O)xZ(x) + r 3 Z(x) + [T l5 T 2 ] q 



vlxZ(x) + r 5 + T 3 rJxZ(x) + r 5 + T 3 

define the disparity d(x) = 1/Z(x), so the above expression becomes 



(C.3) 



V ; rjx + (r 5 + T 3 )d(x) 

We can now apply a planar isometric transformation g G SE(2) defined in such a way 
that w{x) = w o g _1 {x) satisfies w(0) = 0, and w(x) does not depend on R\{6). To 
this end, ifg = (R, T), we note that 

~, x • .-i, v R2Ri(0)R T (x - f) + r 3 + [Tl T 2 ] r d>) 
w(x) = w o q (x) = = (C.5) 

a«d =do g'^x) = d(R T (x - T)) is an unknown function, just like d was. We 
now see that imposing 2 

fl = i?i(0) and f = R^ 1 [T u T 2 ] T d(0) (C.6) 

w£ /zav£ that the residual diffeomorphism is given by 

= ^ + [T 1 ,T 2 F(J(,)-d ) j 
r^ + (r 5 +T3)rf(x) 

Note that w does not depend on R\{0); because of the assumption on visibility and 
piecewise smoothness of the scene, d is a continuous function, and therefore the func- 
tion d(x) — do in a neighborhood of the origin x = can be made arbitrarily small; 
such a function is precisely the derivative ofw with respect to T\,T 2 . 
In the limit where the neighborhood is infinitesimal, or when the scene is fronto- 
parallel, so that d{x) = const., we have that 

w(x)* , y xeBM- (C8) 

t\x + (r 5 +T 3 )d(x) 

A canonization mechanism can be designed to choose R, for instance so that the or- 
dinate axis of the image is aligned with the projection of the gravity vector onto the 
image, and to choose T, for instance by choosing as origin of the image plane some 
statistic of the image, such as the location of an extremum, or the extremum of the 
response to a linear operator. Such operators are called "co- variant detectors." 



2 It may seem confusing that the definition of T is "recursive" in the sense that f = 
R~ 1 [T 1 ,T 2 ] T d(x) = R~ 1 [T 1 ,T 2 ] T d(R T (x-f)). This, however, is not a problem because f is chosen 
not by solving this equation, but by an independent mechanism of imposing that w(0) = 0, that is called a 
"translation co-variant detector." 
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The consequence of this theorem, and the previous observation that S and p cannot 
be disentangled, are that we can model the Lambert- Ambient model (C.l) as 

I(x) = k o p o w o g(x) + n(x). (C.9) 

In the absence of noise n, the canonization process would enable us to mod-out fc, w 
and g, and would yield a canonical element / that belongs to the equivalence class [p] 
under viewpoint and contrast transformations. This is the ART introduced in [177]. 

In the presence of noise, the group g acts linearly on the image, in the sense that 
(I i + I2) g = h 9 + I2 9- So, the canonization process effectively eliminates the 
dependency on g: 

I o = k o p o w(x) + n(x) (C.10) 

where n(x) = o^ _1 (x). Because g is an isometry, n will be a transformed realization 
of a random process that has the same statistical properties (e.g. mean and covariance) 
of n. Although w also acts linearly on the image, h o w -1 does not have the same 
statistical properties of h, because the diffeomorphism w alters the distribution of h. 
Therefore, the canonization process ofw does not commute with the additive noise and 
cannot be performed in an exact fashion. 

Similarly, the general contrast transformation k does not act linearly on the image, 
in the sense that k~ l o (Ii + I2) 7^ k~ x o Ii + o I 2 . Similarly to what we have 
done for w, we can isolate the affine component of k, that is the contrast transformation 
I 1-^ olI + (3, and canonize that. For simplicity, we just assume that k is not a general 
contrast transformation, but instead an affine contrast transformation. By canonizing 
that we have 

k' 1 o I o g~ r (x) = p o w(x) + n'(x) (C.ll) 

where now n'(x) has a statistical description that can be easily derived as a function 
of the statistical description of n and the values of a, fj in the contrast canonization (if 
p and a are the mean and standard deviation of n' , then (/i — /3)/a and a /a are the 
mean and standard deviation of n'). If we summarize the canonization process as a 
functional <p acting on the image, and forgo all superscripts and tildes, we have 

<t>{I{x)) = p o w(x) + n(x). (C.12) 

When the noise is "small" one can think of <p(I) as a small perturbation of a point 
on the base space of the orbit space of equivalence classes [p o w] under the action of 
planar isometries and affine contrast transformations. 

C.2.1 Occlusions 

In the presence of occlusions, including self-occlusions, the map w is not, in general, 
a diffeomorphism. Indeed, it is not even a map, in the sense that for several locations 
in the image, x G O, it is not possible to find any transformation w(x) that maps the 
radiance p onto the image /. In other words, if D is the image-domain, we only have 
that 

I(x) = ko pow(x), xeD\n. (C.13) 
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The image in the occluded region Q can take arbitrary values that are unrelated to 
the radiance p in the vicinity of the point S(w(x)) G M 3 ; we call these values f3(x). 
Therefore, neglecting the role of the additive noise n{x) for now, we have 

I(x) = kopo w (x)(l - xn{x)) + P(x)xn(x), x G D. (C.14) 

The canonization mechanism acts on the image /, and has no knowledge of the oc- 
cluded occluded region Q. Therefore, may eliminate the effects of the nuisances 
k and w, if it based on the values of the image in the visible region, or it may not - if 
it is based on the values of the image in an occluded region. If (f)(1) is computed in a 
region R T B a (x — T), then the canonization mechanism is successful if 

R T B a (x-f) c D\Q. (C.15) 

And fails otherwise. Whether the canonization process succeeds or fail can only be 
determined by comparing the statistics of the canonized image (f)(1) with the statistics 
of the radiance, p, which is of course unknown. However, under the Lambertian as- 
sumption, this can be achieved by comparing the canonical representation of different 
images. 

Determining co-visibility 

If range maps were available, one could test for co- visibility as follows. Let Z : D C 
R 2 — >> R + ; x ^ Z(x; S) be defined as the distance of the point of first intersection of 
the line through x with the surface x: 

Z(x; S) = min{Z > | xZ G S}. (C.16) 

When the surface is moved, the range map changes, not necessarily in a smooth way 
because of self-occlusions: 

Z(x; gS) = min{Z > | gxZ G S}. (C.17) 

A point with coordinates xq on an image is co-visible with a point with coordinates x 
in another image taken by a camera that has moved by g G SE(3) if 

xZ(x]gS)=gx Z(x ]S) (C.18) 

or, equivalently, g~ x xZ(x\S) = xqZ(xq] g~ x S). An alternative expression can be 
written using the third component of the equation above, that is 

Z(x;gS) = eJgS(x ) (C.19) 

Therefore, the visible domain D\Q is given by the set of points x that are co- visible 
with any point xo G D. Vice- versa, the occluded domain is given by points that are not 
visible, i.e. 



n = {x G D I Z(x- gS) + eJgx Z(x ; S), x eD}. 



(C.20) 
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C.2.2 Summary 

The analysis above allows us to describe the shape spaces of the Lambert- Ambient 
model. Specifically: 

• In the absence of noise, n = and occlusions Q — 0, the shape space of the 
Lambert- Ambient model is the set of Attributed Reeb Trees [177]. 

• In the presence of additive noise n / 0, but no occlusions, ft = 0, the shape 
space is the collection of radiance functions composed with domain diffeomor- 
phisms with a fixed point (e.g. the origin) and a fixed direction (e.g. gravity). 

• In the presence of noise and occlusions, the shape spaces is broken into local 
patches that are the domain of attractions of covariant detectors. The size of the 
region depends on scale and visibility and cannot be determined from one da- 
tum only. Co- visibility must be tested as part of the correspondence process, by 
testing for geometric and topological consistency, as well as photometric consis- 
tency. 
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Appendix D 

Legenda 



Symbols are defined in the order in which they are introduced in the text. 

• /: an image, either intended as an array of positive numbers {Iij G 

or a map defined on the lattice 1? with values in the positive reals, / : 1? — >> 
R + ; (i, j) ^ or in the continuum as a map from the real plane, or a 

subset D of the real plane, / : D C M 2 — » M + ; x H> For a time- varying 

image, or a video, this is interpreted as a map / : D C R —> M + ; H> 
t) or sometimes It(x). The temporal domain can be continuous t G R, or 
discrete, £ G Z. For color images, the map / takes values in the sphere S 2 CM 3 , 
represented with three normalized coordinates (RGB, or YUV, or HSV), and 
more in general, / can take values in R k for multi- spectral sensors. 

• D: the domain of the image, usually a compact subset of the real plane or of the 
lattice, D cM 2 orDcZ 2 . 

• h: Symbolic representation of the image formation process. 

• £: symbolic representation of the "scene". S, the space of all possible scenes. 

• p, S: symbolic representation of the geometry of the scene. S C M 3 is a piece- 
wise smooth, multiply-connected surface, and p G S the generic point on the 
surface. 

• p: a symbolic representation of the reflectance of the scene, p : S —> M + 
represents the diffuse albedo of the surface S, or the radiance if no explicit illu- 
mination model is provided. 

• g G G: motion group. If G = SE(3), g represents a Euclidean (rigid body) 
transformation, composed of a rotation R G 50(3) (an orthogonal matrix with 
positive determinant) and a translation T G M 3 . When g is indexed by time, we 
write g(t) or g t . 

• v\ Symbolic representation of the nuisances in the image formation process. 
They could represent viewpoint, contrast transformations, occlusions, quantiza- 
tion, sensor noise. 
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• m: Contrast transformation, m : R — >> R is continuous and monotonic. Usually 
the arguments are normalized so that m : [0, 1] — >> [0, 1]. 

• M the space of contrast transformations. 

• n: Noise process, including all unmodeled phenomena in the image formation 
process. 

• SE(3): Special Euclidean group of rigid body motions in three-dimensional 
Euclidean space. 



• 7r: perspective projection: it : 

%i = X 1 /X 3 andx 2 = X 2 /X 3 . 

• Homogeneous coordinates x = 



P2. 



; X H> tt(X) = x where 



• w: Domain diffeomorphism, w : D C 
differentiable inverse. 

• Z: Depth map, Z:Dcl + ;x4 



that is differentiable, with 



• tt s 1 (x) = {p e S \ 7r(p) — x}: Pre-image of the pixel x, the intersection of the 
projection ray x with every surface on the scene. 

• tt~ 1 (x): Pre-image of the pixel x, the intersection of the projection ray x with 
the closest point on the scene. 

• R,T: Rotational and translational component of the motion group g G G, usu- 
ally indicated with g = (R, T), or in homogeneous coordinates as a 4 x 4 matrix 



G 



R T 
1 



• Q: the occluded domain, a subset of D, that back-projects onto a portion of the 
scene that is visible from the current image, I(x, t), but not from neighboring 
images I(x, t + 1) or I(x, t — 1). 

• c: Class label, without loss of generality assumed to be a positive integer, or 
simply c G {0, 1} for binary classification. 

• R(-): a risk functional. Not to be confused with R G 50(3), a rotation matrix. 

• J\f\ Normal (Gaussian) density function. 

• dfi\ Base (Haar) measure on a space. 

• dP, dQ: Probability measures. When these are (Radon-Nikodym) differentiable 
with respect to the base measure, we have dP = pdfi and dQ = qdfi where q 
are probability density functions. 



b\ A feature, i.e. any statistic, or deterministic function of the data. 
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• £: A representation. 

• H : a complexity measure, for instance coding length, or algorithmic complexity, 
or entropy if the image is thought of as a distribution of pixels. 

• H: Actionable information 

• : Complete information 

• ip: A feature detector, a particular case of feature that serves to fix the value of a 
particular group element g G G. 

• •): A distance. 

• || • || : A norm. 

• /: Quotient. 

• \: Set difference. 

• E [•] , E p [•] : Expectation, expectation with respect to a probability measure, E p [f] = 

jfdp. 

• ~: Similarity operator, x ~ y, denoting the existence of an equivalence relation, 
and x, y belonging to the same equivalence class; also used to denote that an 
object is "sampled" from a probability distribution, x ~ dP. 

• [•]: An equivalence class. 

• o: Composition operator 

• V 3: For all, exists. 

• ^: The "hat" operator mapping R 3 to se(3), the set of 3 x 3 skew- symmetric 
matrices. 

• The estimate of an object, for instance x is an estimate of x. 

• | • |: Determinant of a matrix \A\, or volume of a set \Q\, or absolute value of a 
real number \x\, or the Euclidean norm of a vector in R N , |x|. 

• X : A characteristic function. For a set Xn( x ) = 1 if # £ fi, and otherwise. 
Sometimes it is indicated by x(fi) when the independent variable is clear from 
the context. 

• B, B(j{x)\ An open set (a "ball") of radius a centered at x. 

• Q: A set, or a convolution kernel. 

• J: The Jacobian determinant (the determinant of the derivative of a transforma- 
tion). 

• GL(3) the general linear group of 3 x 3 invertible matrices. Similarly, GL(4). 
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• u: The input to a dynamical system, usually denoting a variable that is actively 
controlled and therefore known with high precision. 

• VI: The gradient of the image, consisting of two components V X I and V y I. 

• uj: Either a set, or a rotational velocity vector, depending on the context. 

• The restriction of an image to a set. 

• {•}: A set. 

• X: The set of images. 

• 5: Either Dirac's delta distribution, defined implicitly by J S(x — y)f(y)dy = 
f(x) and J S(x)dx = 1; or Kronecker's delta, S(i,j) = 1 if i = j, and 
otherwise. 

• * convolution operator. For two functions f,g,f*g = j f(x — y)g(y)dy. 

• H 2 the half-sphere in R 3 . 

• £ p \ finite-dimensional spaces of p-convergent sequences, for instance absolutely 
convergent (p = 1) or square- summable (p = 2) sequences. 

• L p : infinite-dimensional spaces of p-integrable functions, for instance Lebesgue- 
measurable functions L 1 or square-integrable functions L 2 . 

• H: The space where the average temporal signal lives, for use in Dynamic Time 
Warping. 

• H 2 : A Sobolev space. 

• ®: The composition operator to update a representation with the innovation. It 
would be a sum of all the elements involved were linear. 

• I: Mutual Information. 



